-
Notifications
You must be signed in to change notification settings - Fork 124
Dryrun caching #8486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dryrun caching #8486
Conversation
e4ec61f to
e5f7754
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
dryrun failing due to unrelated error in |
af4028d to
8932af9
Compare
This comment has been minimized.
This comment has been minimized.
7f0b230 to
163789c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I won't merge this before next week. |
BenWu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the test failure is due to reset-stage-env not having a dependency on test_routines so stage udfs are getting deleted
bigquery_etl/dryrun.py
Outdated
| try: | ||
| # write to temporary file first, then atomically rename | ||
| # this prevents race conditions where readers get partial files | ||
| temp_file = Path(str(cache_file) + f".tmp.{os.getpid()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a possible edge case if this is called from a thread pool where the threads have the same PID an extra random string might be better.
| temp_file = Path(str(cache_file) + f".tmp.{os.getpid()}") | |
| temp_file = Path(str(cache_file) + f".tmp.{os.getpid()}.{os.urandom(4).hex()}") |
bigquery_etl/dryrun.py
Outdated
| table_cache_key = hashlib.sha256(table_identifier.encode()).hexdigest() | ||
| cache_file = cache_dir / f"table_metadata_{table_cache_key}.pkl" | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache read and write code is almost identical between the metadata and the dry run result. Moving the common code into a common function would make this easier to maintain
7fbd6bb to
63b77e5
Compare
Integration report for "Address feedback"
|
| project=project_name, | ||
| dataset=dataset_name, | ||
| table=table_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out this particular change is causing problems with deploying ETLs to stage:
Description
This PR adds a cache to
DryRunwhich caches query and table schemas for a configurable amount of time and as long as the query content is the same. This reduces the schema generation time slightly which is executed before table artifact deploys (locally it saved 3 to 4 minutes).Reviewer, please follow this checklist