Create idx_bucket_data_compact index to optimize Postgres compact query #442
+45
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The query is slow due to the COLLATE "C" clause that was introduced in #217 and prevents the usage of indexes. I confirmed this by temporarily removing it locally and running compaction on 1 million random records multiple times. With COLLATE "C" removed, compaction time dropped from 15–20 seconds to under one second. This means that ideally the clause should be replaced/removed from the query. However, that is not possible because it is required to correctly order the results and removing it would produce incorrect ordering and introduce a risk of data corruption. I can definitely go this route, but it will require a significant amount of time/days to proper test and validate not only the code but specially the migration path, but let me know if you would be interested on waiting more time, or alternatively I can sync with you and talk about it on the high level.
As a result, I took a different approach by standardizing the encoding of the bucket_name field in the query and adding an index. The reasoning behind this is to avoid that PG keeps converting between different encodings ("C" and the database default). This change resulted in a 30–40% performance improvement, addressing the original concern raised in #400. More details on how this was tested in below section. Having that said I acknowledge that this is not the most optimized solution, that would be the removal of
COLLATE "C"that I mentioned before. The tradeoff I'm doing here is to give value in a reasonable time with a solution that is less intrusive and doesn't require a significant amount of changes. Please be aware that I assumed here that the solution should target the query itself, rather than optimizing the entire compaction process. If this turns out to not be true please let me know.Testing
Unit/Integration tests
Manual tests
The way I tested this was by first extracting the query from Typescript code and write an isolated test script, the actions this script does are as following:
Manual test script: test-optimization.sh, please noticed that I don't have the information on the average distribution of operations that customers have in their table, so I for this test I used arbitrary values.
Example of execution: