Replies: 1 comment
-
|
Hi @marrov, Quarantine records are definitely an interesting topic in data engineering. While PySpark's compute engine has a placeholder for such functionality, it was never fully implemented. In fact, I recall that in the early branches of this repository, a commit by @vestalisvirginis introduced a The concept is straightforward, and I agree that it would be a valuable addition to cuallee's functionality. However, based on my experience implementing quarantine records—particularly in use cases like an Asset Manager receiving data from well-known vendors like FactSet or Bloomberg—I’ve found that quarantine mechanisms may not always be worth the added complexity. Here's why: Dataset Management Overhead Increased Complexity in Automation Schema Maintenance Challenges Philosophical Conflict That said, adding a quarantine feature could enhance the value of validation checks. The challenge lies in exposing the relevant predicates from the underlying compute engine (e.g., PySpark) through the user-facing interface, such as cuallee's Check. Let me think through some potential approaches for implementing this. In the meantime, I'd love to hear your thoughts and ideas as well. I hope this perspective is helpful! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a question that I have seen repeatedly from DEs working with
cuallee: can we use checks in thecualleeto enable a quarantining feature?With
cualleeyou get the number of rows that fail from your checks but not the rows themselves, AFAIK. The use case here is instead of just having that as a statistic collected in a table, users would like to be able to use the defined checks to split the incoming df into one that passes the tests and one that does not. This way these bad records can be investigated and re-processed into the clean table or dropped entirely. Any thoughts on this? Or maybe this idea is an anti-pattern altogether? Would love to hear your thoughts about this.Beta Was this translation helpful? Give feedback.
All reactions