-
Notifications
You must be signed in to change notification settings - Fork 0
FEAT: Reapply mask to discarded mutations #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- following what Fede (federica.brando@irbbarcelona.org) did in the PR #153
- Also adding a new feature to extract flagged regions to a bed file. - Adding wrapper function to clean-up main. e On branch 118-reapply-mask-discarded-mutation
e Your branch is ahead of 'origin/118-reapply-mask-discarded-mutation' by 1 commit.
- add output as pipeline in 'module.config' - add output in `filtermaf` module - add output in the `mutationpreprocessing`subworkflow
- Added ANNOTATEPANELS with the VEP and other annotation steps - CREATEPANELS now only contains the creation of the modalities of panels
- Bed files emited by ANNOTATEPANEL are used by MUTPREPROCESSING - Added arguments to CREATECONSENSUSPANELS for the ANNOTATEPANEL subworkflow.
- Get the flagged bed into 0-based half-open coordinates. - Useful to use bedtools and filter the positions from the panels.
- publishDir was wrongly done, already corrected - uncommented the filtering logic in CREATEPANELS although it is not doing anything as of now
- Created new process using bedtools to remove positions from panel tsv file using the flagged bed obtained from MUTPREPROCESSING - Added this process to the CREATE_PANELS subworkflow - Added metadata about this process - Added flagged bed as input to CREATEPANELS
-Changed to ANNOTATEPANELS
- Also removed debug printing of channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces a new filtering mechanism to remove variants from annotated panels that overlap with flagged genomic regions, significantly refactors the panel creation workflow into two distinct subworkflows (ANNOTATEPANELS and CREATEPANELS), and reapplies code reviews from PR #153. The changes enable masking of sites where filters (repetitive_variant, cohort_n_rich, other_sample_SNP) were applied to ensure these regions are properly excluded from downstream analysis.
Changes:
- Added a new FILTER_CAPTURED_PANEL module that uses bedtools intersect to remove variants overlapping flagged regions and track removed variants
- Split panel creation into two workflows: ANNOTATEPANELS (handles VEP annotation, custom annotation, and domain annotation) and CREATEPANELS (applies filtering and generates final panels)
- Refactored filter_cohort.py with improved code structure, logging, documentation, and a new extract_flagged_regions_bed function to output flagged positions in BED format
- Updated workflow dependencies so MUTPREPROCESSING outputs flagged_bed that feeds into CREATEPANELS for filtering
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/deepcsa.nf | Reorganized workflow to use ANNOTATEPANELS before MUTPREPROCESSING, then CREATEPANELS with flagged_bed input; updated references from CREATEPANELS to ANNOTATEPANELS outputs where appropriate |
| subworkflows/local/mutationpreprocessing/main.nf | Added flagged_bed emission from FILTERBATCH output to enable downstream filtering |
| subworkflows/local/createpanels/createpanels/main.nf | Refactored to accept complete_annotated_panel and flagged_bed inputs, applies FILTERCAPTUREDPANEL before generating panels, removed annotation logic moved to ANNOTATEPANELS |
| subworkflows/local/createpanels/createpanels/meta.yml | Added empty meta.yml file (requires documentation) |
| subworkflows/local/createpanels/annotatepanels/main.nf | New subworkflow containing annotation logic (VEP, custom annotation, domain annotation) previously in createpanels, generates initial panels for MUTPREPROCESSING |
| subworkflows/local/createpanels/annotatepanels/meta.yml | Comprehensive documentation for the annotation workflow with detailed input/output descriptions |
| modules/local/filtermaf/main.nf | Added flagged_muts output emission for flagged-pos.bed file |
| modules/local/filter_captured_panel/main.nf | New module that filters annotated panel using bedtools intersect to remove variants overlapping flagged regions, outputs both filtered panel and removed variants |
| modules/local/filter_captured_panel/meta.yml | Documentation for the filtering module |
| conf/modules.config | Split FILTERBATCH configuration to publish .bed files, disabled publishing for MERGEBATCH |
| conf/tools/panels.config | Updated process paths from CREATEPANELS to ANNOTATEPANELS for VEP and related processes |
| bin/utils.py | Improved documentation for add_filter function with proper parameter descriptions |
| bin/filter_cohort.py | Major refactoring: split monolithic main function into separate functions (flag_repetitive_variants, flag_cohort_n_rich, flag_other_samples_snp, flag_gnomad_snp, expand_filter_column, extract_flagged_regions_bed), added logging, improved documentation, added BED output generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Publish FILTERCAPTUREDPANEL output into createpanels folder - Add bedtools to versions instead of python
FerriolCalvet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Marta!
I added some minor comments in some files but I also have some more general comments that I would like to discuss before merging, mainly because there are different masking scenarios that probably I did not acknowledge properly :pray-hands:
(I am trying to explain what I understood to make sure that I got it properly)
The way it works for the masking is by removing the positions from the panels let's say that there is a flagged position in the middle of an exon, this would result in the exon in the BED regions of the consensus panel being split in two right? and on that position not being in the consensus panel TSV.
OncodriveFML requires a BED file with the coordinates of the different genomic regions, this is probably not impacting at all, but we could look at the consequence of this change outside omega.
Initially I thought that a possibility could be to "just" putting the depths to 0 of that position and then this could have a similar effect to the final definitions of panels. The reality is that with the current implementation this works perfect for positions that need to be masked for the entire cohort, but does not allow the flexibility of removing a position only in one sample.
We can discuss whether we want to merge this implementation and then work on an additional processing that enables the sample-specific tuning of depths that could then impact the panel building process downstream.
Minor more general comments:
- do we want to use the ymls? I guess it does no harm here, but we should discuss it, and probably apply it as part of the clean branch
| label 'process_single' | ||
| label 'process_medium_high_memory' | ||
|
|
||
| container "community.wave.seqera.io/library/bedtools_pybedtools_pandas_pip_pruned:78080da05d53636d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to take the Dockerfile and upload the image to our dockerhub and the dockerfile to our docker github repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise where else it is used and maybe replace everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in createcapturedpanels
| LOG = logging.getLogger("filter_cohort") | ||
|
|
||
| # Globals | ||
| FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "repetitive_variant", "cohort_n_rich_threshold", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We briefly discussed this off line and I still believe that it would be nice to directly add a new parameter to set this list of filters to apply to the background.
I am removing cohort_n_rich_threshold from this list since it is not applied cohort-wise but has some sample specificity. This filter would benefit from the sample specific filtering with the deths (see overall comment)
I am removing repetitive_variant since we do not necessarily want to filter those out (this would benefit from having a parameter to indicate this list of filters to be applied to the background)
| FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "repetitive_variant", "cohort_n_rich_threshold", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"] | |
| FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we agree on using the filter_criteria and filter_criteria_somatic to define which of these flags should be used for masking the panels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to clarify and make this right. This are the default values of the filters:
filter_criteria = ["notcontains NM20", "notcontains n_rich", "notcontains cohort_n_rich_threshold", "notcontains cohort_n_rich", "notcontains no_pileup_support", "notcontains low_mappability", "notcontains not_covered", "notcontains nanoseq_noise"]
filter_criteria_somatic = ["notcontains gnomAD_SNP", "notcontains nanoseq_snp"]I was thinking that we could use this same parameters and let the user define which positions to filter depths/panels as well (which will be the ones in the mutations), but using the sample-specific parameters in the depth=0 filtering and the others in this approach. What do you think, is this what you were thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it properly yes, basically all the filters that are applied should be removed from the background, either at cohort-level or sample-level.
|
|
||
| return maf_df | ||
|
|
||
| def extract_flagged_regions_bed(maf_df: pd.DataFrame, maf_file_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go for the implementation where we apply sample specific filters, we would need to report this information per sample as well
|
Thank you for the review Ferriol! I will check the comments tomorrow, but I wanted to answer the general comment so everything is here in case we need it. You're right, removing a position from the panel makes the regions in the bed to be split. About the ymls, I think it could be useful for new people trying to understand how the pipeline works. But I am okay with moving them to the clean branch. |
|
Some time ago I made some fixes to prevent the 0s problem Fede experienced above (4551e17), so hopefully this should not be a problem now. I don't remember if there were other methods for which this was also a problem but I think it is worth testing it again. And yes, agree with the ymls, I would leave the ones you created here and then try to add updated ones for all steps in the clean branch. let's discuss soon! |
|
We have decided to try applying the option of forcing the depth to 0 for those positions to be removed in specific samples. If it is too complex, we could merge this branch and do it later. |
Copilot summary
This pull request introduces a new filtering step for annotated panels to remove variants overlapping flagged genomic regions, updates the panel creation workflow to integrate this filtering, and makes several related configuration and workflow changes. Additionally, some code reviews from PR153 are reapplied
Key changes:
New module for filtering captured panels:
filter_captured_panelprocess (modules/local/filter_captured_panel/main.nf) and its metadata (modules/local/filter_captured_panel/meta.yml), which filters annotated panels usingbedtools intersectto remove variants overlapping flagged regions. Outputs both filtered panels and a list of removed variants. [1] [2]Integration of filtering into panel creation workflow:
subworkflows/local/createpanels/createpanels/main.nf, previouslymain.nf) to include the new filtering step (FILTERCAPTUREDPANEL). The workflow now takes aflagged_bedinput fromMUTPREPROCESSINGand generates panels from the filtered annotated panel. The output now includes the list of removed variants. [1] [2] [3]subworkflows/local/createpanels/annotatepanels/main.nf). [1] [2]Workflow and configuration updates:
workflows/deepcsa.nf) to use the new, more modularCREATEPANELSandANNOTATEPANELSsubworkflows.flagged_bedoutput from theFILTERBATCHprocess, enabling downstream filtering.FILTERBATCHprocess inconf/modules.configto publish.bedfiles for flagged mutations, and adjustedMERGEBATCHand related processes to disable publishing..bedfiles to theFILTER_BATCHprocess inmodules/local/filtermaf/main.nf.