Create idx_bucket_data_compact index to optimize Postgres compact query #442

jaswdr · 2025-12-16T16:48:12Z

Description

The query is slow due to the COLLATE "C" clause that was introduced in #217 and prevents the usage of indexes. I confirmed this by temporarily removing it locally and running compaction on 1 million random records multiple times. With COLLATE "C" removed, compaction time dropped from 15–20 seconds to under one second. This means that ideally the clause should be replaced/removed from the query. However, that is not possible because it is required to correctly order the results and removing it would produce incorrect ordering and introduce a risk of data corruption. I can definitely go this route, but it will require a significant amount of time/days to proper test and validate not only the code but specially the migration path, but let me know if you would be interested on waiting more time, or alternatively I can sync with you and talk about it on the high level.

As a result, I took a different approach by standardizing the encoding of the bucket_name field in the query and adding an index. The reasoning behind this is to avoid that PG keeps converting between different encodings ("C" and the database default). This change resulted in a 30–40% performance improvement, addressing the original concern raised in #400. More details on how this was tested in below section. Having that said I acknowledge that this is not the most optimized solution, that would be the removal of COLLATE "C" that I mentioned before. The tradeoff I'm doing here is to give value in a reasonable time with a solution that is less intrusive and doesn't require a significant amount of changes. Please be aware that I assumed here that the solution should target the query itself, rather than optimizing the entire compaction process. If this turns out to not be true please let me know.

Testing

Unit/Integration tests

cd ./modules/module-postgres-storage && pnpm run test
...
 Test Files  5 passed (5)
      Tests  73 passed (73)
   Start at  17:40:43
   Duration  22.70s (transform 575ms, setup 1.35s, collect 1.06s, tests 20.14s, environment 0ms, prepare 42ms)

Manual tests

The way I tested this was by first extracting the query from Typescript code and write an isolated test script, the actions this script does are as following:

Cleanup/Drops the database to make sure we start from scratch
Creates the table used in the query
Inserts 1 million records to use for testing
Run the query once to warm-up cache
Run the query again multiple times and record the timing as baseline
Create the index
Run the query once after creating the index to warm-up cache
Run the query again multiple times and record the timing for comparison with baseline

Manual test script: test-optimization.sh, please noticed that I don't have the information on the average distribution of operations that customers have in their table, so I for this test I used arbitrary values.

Example of execution:

DROP DATABASE
Inserting 1M rows...

=== BASELINE ===
Timing is on.
Time: 4.732 ms
Time: 3.112 ms
Time: 3.204 ms
Time: 2.687 ms
Time: 2.737 ms

Creating index...

=== OPTIMIZED ===
Timing is on.
Time: 4.109 ms
Time: 2.427 ms
Time: 1.577 ms
Time: 1.551 ms
Time: 1.540 ms

changeset-bot · 2025-12-16T16:48:15Z

⚠️ No Changeset found

Latest commit: 065502f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

CLAassistant · 2025-12-16T16:48:21Z

All committers have signed the CLA.

rkistner

We'd prefer to avoid an additional index here if possible - I'm hoping the existing primary key index can be used.

The original COLLATE 'C' may have been a mistake. What would the impact be if that is removed?

Note that we don't need to process buckets in any specific order (bucket_name field), but we do need to process the operations within a bucket in strict op_id order.

Create idx_bucket_data_compact index

065502f

jaswdr mentioned this pull request Dec 16, 2025

[Postgres Storage] Inefficient index usage when compacting #400

Open

rkistner reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create idx_bucket_data_compact index to optimize Postgres compact query #442

Create idx_bucket_data_compact index to optimize Postgres compact query #442

jaswdr commented Dec 16, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Dec 16, 2025

Uh oh!

CLAassistant commented Dec 16, 2025 •

edited

Loading

Uh oh!

rkistner left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Create idx_bucket_data_compact index to optimize Postgres compact query #442

Are you sure you want to change the base?

Create idx_bucket_data_compact index to optimize Postgres compact query #442

Conversation

jaswdr commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Unit/Integration tests

Manual tests

Uh oh!

changeset-bot bot commented Dec 16, 2025

⚠️ No Changeset found

Uh oh!

CLAassistant commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkistner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaswdr commented Dec 16, 2025 •

edited

Loading

CLAassistant commented Dec 16, 2025 •

edited

Loading