Skip to content

Conversation

@jaswdr
Copy link

@jaswdr jaswdr commented Dec 16, 2025

Description

The query is slow due to the COLLATE "C" clause that was introduced in #217 and prevents the usage of indexes. I confirmed this by temporarily removing it locally and running compaction on 1 million random records multiple times. With COLLATE "C" removed, compaction time dropped from 15–20 seconds to under one second. This means that ideally the clause should be replaced/removed from the query. However, that is not possible because it is required to correctly order the results and removing it would produce incorrect ordering and introduce a risk of data corruption. I can definitely go this route, but it will require a significant amount of time/days to proper test and validate not only the code but specially the migration path, but let me know if you would be interested on waiting more time, or alternatively I can sync with you and talk about it on the high level.

As a result, I took a different approach by standardizing the encoding of the bucket_name field in the query and adding an index. The reasoning behind this is to avoid that PG keeps converting between different encodings ("C" and the database default). This change resulted in a 30–40% performance improvement, addressing the original concern raised in #400. More details on how this was tested in below section. Having that said I acknowledge that this is not the most optimized solution, that would be the removal of COLLATE "C" that I mentioned before. The tradeoff I'm doing here is to give value in a reasonable time with a solution that is less intrusive and doesn't require a significant amount of changes. Please be aware that I assumed here that the solution should target the query itself, rather than optimizing the entire compaction process. If this turns out to not be true please let me know.

Testing

Unit/Integration tests

cd ./modules/module-postgres-storage && pnpm run test
...
 Test Files  5 passed (5)
      Tests  73 passed (73)
   Start at  17:40:43
   Duration  22.70s (transform 575ms, setup 1.35s, collect 1.06s, tests 20.14s, environment 0ms, prepare 42ms)

Manual tests

The way I tested this was by first extracting the query from Typescript code and write an isolated test script, the actions this script does are as following:

  1. Cleanup/Drops the database to make sure we start from scratch
  2. Creates the table used in the query
  3. Inserts 1 million records to use for testing
  4. Run the query once to warm-up cache
  5. Run the query again multiple times and record the timing as baseline
  6. Create the index
  7. Run the query once after creating the index to warm-up cache
  8. Run the query again multiple times and record the timing for comparison with baseline

Manual test script: test-optimization.sh, please noticed that I don't have the information on the average distribution of operations that customers have in their table, so I for this test I used arbitrary values.

Example of execution:

DROP DATABASE
Inserting 1M rows...

=== BASELINE ===
Timing is on.
Time: 4.732 ms
Time: 3.112 ms
Time: 3.204 ms
Time: 2.687 ms
Time: 2.737 ms

Creating index...

=== OPTIMIZED ===
Timing is on.
Time: 4.109 ms
Time: 2.427 ms
Time: 1.577 ms
Time: 1.551 ms
Time: 1.540 ms

@changeset-bot
Copy link

changeset-bot bot commented Dec 16, 2025

⚠️ No Changeset found

Latest commit: 065502f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@CLAassistant
Copy link

CLAassistant commented Dec 16, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@rkistner rkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd prefer to avoid an additional index here if possible - I'm hoping the existing primary key index can be used.

The original COLLATE 'C' may have been a mistake. What would the impact be if that is removed?

Note that we don't need to process buckets in any specific order (bucket_name field), but we do need to process the operations within a bucket in strict op_id order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants