-
-
Notifications
You must be signed in to change notification settings - Fork 133
feat(experimental): Add schema change support for BigQuery #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c897c20 to
607d30c
Compare
| pg_escape = { version = "0.1.1", default-features = false } | ||
| pin-project-lite = { version = "0.2.16", default-features = false } | ||
| postgres-replication = { git = "https://github.com/MaterializeInc/rust-postgres", default-features = false, rev = "c4b473b478b3adfbf8667d2fbe895d8423f1290b" } | ||
| postgres-replication = { git = "https://github.com/iambriccardo/rust-postgres", default-features = false, rev = "31acf55c7e5c2244e5bb3a36e7afa2a01bf52c38" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟠 Severity: HIGH
Supply Chain Risk: Unverified Git Dependencies
Two critical PostgreSQL driver dependencies (postgres-replication and tokio-postgres) were switched from the trusted Materialize fork to a personal GitHub fork. These handle authentication and replication stream parsing. Unlike crates.io packages, git dependencies lack security vetting, can be force-pushed or deleted, and create single points of failure. The same risk applies to line 77 (tokio-postgres).
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Replace the personal GitHub fork (https://github.com/iambriccardo/rust-postgres) with a trusted source. Options: (1) If possible, use the official crates.io versions of postgres-replication and tokio-postgres; (2) If custom patches are required, fork to your organization's GitHub account with proper access controls and security review processes; (3) If changes are ready, publish to crates.io after security review; (4) Document the specific changes required from the fork and track upstream merge status. This affects both line 60 (postgres-replication) and line 77 (tokio-postgres). Before making changes, identify what patches or modifications exist in the personal fork that differ from upstream to ensure functionality isn't broken.
Schema Evolution Support for BigQuery Destinations
This PR introduces end-to-end schema evolution support for BigQuery destinations, enabling replication pipelines to detect, track, and apply schema changes from source PostgreSQL tables.
Design Overview
DDL Event Trigger
Schema changes are captured transactionally using a PostgreSQL DDL event trigger. When an
ALTER TABLEexecutes, the trigger fires within the same transaction, emitting a logical replication message before the transaction commits. This guarantees schema change events are ordered consistently with data changes in the replication stream.Schema Versioning
Schema versions are tracked using
snapshot_id:snapshot_id=0(base schema)snapshot_id=start_lsnof the DDL logical message, providing total ordering and stable identifiersOn restart, the system loads the schema version with the largest
snapshot_id <= confirmed_flush_lsn.Replication Masks
A replication mask is a bitmask determining which columns of a
TableSchemaare actively replicated. This decouples column-level filtering from schema loading by:Relationmessages to determine which columns to replicateReplicatedTableSchema, a wrapper exposing only active columnsThis allows columns to be added/removed from a publication without breaking the pipeline.
Destination Table Metadata
Introduces
DestinationTableMetadatato store destination-side table metadata, replacing previous table mappings with a more extensible structure. The system tracks applied schemas in its own store rather than querying the destination catalog, storing:snapshot_id: last successfully applied schemareplication_mask: column mask for the applied schemaprevious_snapshot_id: schema before current (used for recovery)Transactional Safety
Schema change atomicity varies by destination, so we use a two-phase state approach:
When a schema change begins, we transition to
Applyingbefore modifying the destination. If failure occurs mid-change, this state signals corruption and enables recovery flows. Recovery usesprevious_snapshot_idto identify the last known-good schema.Note: BigQuery lacks atomic DDL for multiple statements, so failed schema changes cannot safely roll back and require manual intervention.
Schema Diffing
When a
Relationevent arrives, the destination receives aReplicatedTableSchema. Diffing is performed at the destination layer, giving each destination full control over handling schema changes. TheReplicatedTableSchema::diff()method provides a simple changeset, but destinations can implement custom logic.How It Works
ALTER TABLERelationmessage with updated column informationRelationEventcontaining the newReplicatedTableSchemaOther Changes
describe_table_schemaPostgreSQL function returns schema data in a consistent structure for both initial copy and DDL eventsReplicatedTableSchemais supplied with each event, eliminating schema loading in destinations and enforcing invariants via the type systemFuture Work
RelationEventis required