Skip to content

Conversation

@m-huertasp
Copy link
Collaborator

Copilot summary

This pull request introduces a new filtering step for annotated panels to remove variants overlapping flagged genomic regions, updates the panel creation workflow to integrate this filtering, and makes several related configuration and workflow changes. Additionally, some code reviews from PR153 are reapplied

Key changes:

New module for filtering captured panels:

  • Added the filter_captured_panel process (modules/local/filter_captured_panel/main.nf) and its metadata (modules/local/filter_captured_panel/meta.yml), which filters annotated panels using bedtools intersect to remove variants overlapping flagged regions. Outputs both filtered panels and a list of removed variants. [1] [2]

Integration of filtering into panel creation workflow:

  • Refactored the panel creation workflow (subworkflows/local/createpanels/createpanels/main.nf, previously main.nf) to include the new filtering step (FILTERCAPTUREDPANEL). The workflow now takes a flagged_bed input from MUTPREPROCESSING and generates panels from the filtered annotated panel. The output now includes the list of removed variants. [1] [2] [3]
  • Removed the annotation and customization steps from the main panel creation workflow, moving them to a separate annotation workflow (subworkflows/local/createpanels/annotatepanels/main.nf). [1] [2]

Workflow and configuration updates:

  • Updated the main workflow (workflows/deepcsa.nf) to use the new, more modular CREATEPANELS and ANNOTATEPANELS subworkflows.
  • Modified the mutation preprocessing workflow to emit the flagged_bed output from the FILTERBATCH process, enabling downstream filtering.
  • Updated the FILTERBATCH process in conf/modules.config to publish .bed files for flagged mutations, and adjusted MERGEBATCH and related processes to disable publishing.
  • Added output of flagged mutation .bed files to the FILTER_BATCH process in modules/local/filtermaf/main.nf.

- following what Fede (federica.brando@irbbarcelona.org) did
in the PR #153
- Also adding a new feature to extract flagged
regions to a bed file.
- Adding wrapper function to clean-up main.

e On branch 118-reapply-mask-discarded-mutation
e Your branch is ahead of 'origin/118-reapply-mask-discarded-mutation' by 1 commit.
- add output as pipeline in 'module.config'
- add output in `filtermaf` module
- add output in the `mutationpreprocessing`subworkflow
- Added ANNOTATEPANELS with the VEP and other annotation steps
- CREATEPANELS now only contains the creation of the modalities of
  panels
- Bed files emited by ANNOTATEPANEL are used by MUTPREPROCESSING
- Added arguments to CREATECONSENSUSPANELS for the ANNOTATEPANEL
subworkflow.
- Get the flagged bed into 0-based half-open
coordinates.
- Useful to use bedtools and filter the positions
from the panels.
- publishDir was wrongly done, already corrected
- uncommented the filtering logic in CREATEPANELS
although it is not doing anything as of now
- Created new process using bedtools to remove positions from
panel tsv file using the flagged bed obtained from MUTPREPROCESSING
- Added this process to the CREATE_PANELS subworkflow
- Added metadata about this process
- Added flagged bed as input to CREATEPANELS
- Also removed debug printing of channels
@m-huertasp m-huertasp self-assigned this Jan 19, 2026
@m-huertasp m-huertasp added new-feature New functionality being added. discuss Any potential implementation that requires discussion labels Jan 19, 2026
@m-huertasp m-huertasp linked an issue Jan 19, 2026 that may be closed by this pull request
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a new filtering mechanism to remove variants from annotated panels that overlap with flagged genomic regions, significantly refactors the panel creation workflow into two distinct subworkflows (ANNOTATEPANELS and CREATEPANELS), and reapplies code reviews from PR #153. The changes enable masking of sites where filters (repetitive_variant, cohort_n_rich, other_sample_SNP) were applied to ensure these regions are properly excluded from downstream analysis.

Changes:

  • Added a new FILTER_CAPTURED_PANEL module that uses bedtools intersect to remove variants overlapping flagged regions and track removed variants
  • Split panel creation into two workflows: ANNOTATEPANELS (handles VEP annotation, custom annotation, and domain annotation) and CREATEPANELS (applies filtering and generates final panels)
  • Refactored filter_cohort.py with improved code structure, logging, documentation, and a new extract_flagged_regions_bed function to output flagged positions in BED format
  • Updated workflow dependencies so MUTPREPROCESSING outputs flagged_bed that feeds into CREATEPANELS for filtering

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
workflows/deepcsa.nf Reorganized workflow to use ANNOTATEPANELS before MUTPREPROCESSING, then CREATEPANELS with flagged_bed input; updated references from CREATEPANELS to ANNOTATEPANELS outputs where appropriate
subworkflows/local/mutationpreprocessing/main.nf Added flagged_bed emission from FILTERBATCH output to enable downstream filtering
subworkflows/local/createpanels/createpanels/main.nf Refactored to accept complete_annotated_panel and flagged_bed inputs, applies FILTERCAPTUREDPANEL before generating panels, removed annotation logic moved to ANNOTATEPANELS
subworkflows/local/createpanels/createpanels/meta.yml Added empty meta.yml file (requires documentation)
subworkflows/local/createpanels/annotatepanels/main.nf New subworkflow containing annotation logic (VEP, custom annotation, domain annotation) previously in createpanels, generates initial panels for MUTPREPROCESSING
subworkflows/local/createpanels/annotatepanels/meta.yml Comprehensive documentation for the annotation workflow with detailed input/output descriptions
modules/local/filtermaf/main.nf Added flagged_muts output emission for flagged-pos.bed file
modules/local/filter_captured_panel/main.nf New module that filters annotated panel using bedtools intersect to remove variants overlapping flagged regions, outputs both filtered panel and removed variants
modules/local/filter_captured_panel/meta.yml Documentation for the filtering module
conf/modules.config Split FILTERBATCH configuration to publish .bed files, disabled publishing for MERGEBATCH
conf/tools/panels.config Updated process paths from CREATEPANELS to ANNOTATEPANELS for VEP and related processes
bin/utils.py Improved documentation for add_filter function with proper parameter descriptions
bin/filter_cohort.py Major refactoring: split monolithic main function into separate functions (flag_repetitive_variants, flag_cohort_n_rich, flag_other_samples_snp, flag_gnomad_snp, expand_filter_column, extract_flagged_regions_bed), added logging, improved documentation, added BED output generation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Publish FILTERCAPTUREDPANEL output into createpanels folder
- Add bedtools to versions instead of python
Copy link
Collaborator

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Marta!
I added some minor comments in some files but I also have some more general comments that I would like to discuss before merging, mainly because there are different masking scenarios that probably I did not acknowledge properly :pray-hands:

(I am trying to explain what I understood to make sure that I got it properly)
The way it works for the masking is by removing the positions from the panels let's say that there is a flagged position in the middle of an exon, this would result in the exon in the BED regions of the consensus panel being split in two right? and on that position not being in the consensus panel TSV.
OncodriveFML requires a BED file with the coordinates of the different genomic regions, this is probably not impacting at all, but we could look at the consequence of this change outside omega.

Initially I thought that a possibility could be to "just" putting the depths to 0 of that position and then this could have a similar effect to the final definitions of panels. The reality is that with the current implementation this works perfect for positions that need to be masked for the entire cohort, but does not allow the flexibility of removing a position only in one sample.
We can discuss whether we want to merge this implementation and then work on an additional processing that enables the sample-specific tuning of depths that could then impact the panel building process downstream.

Minor more general comments:

  • do we want to use the ymls? I guess it does no harm here, but we should discuss it, and probably apply it as part of the clean branch

label 'process_single'
label 'process_medium_high_memory'

container "community.wave.seqera.io/library/bedtools_pybedtools_pandas_pip_pruned:78080da05d53636d"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to take the Dockerfile and upload the image to our dockerhub and the dockerfile to our docker github repo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise where else it is used and maybe replace everywhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in createcapturedpanels

LOG = logging.getLogger("filter_cohort")

# Globals
FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "repetitive_variant", "cohort_n_rich_threshold", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We briefly discussed this off line and I still believe that it would be nice to directly add a new parameter to set this list of filters to apply to the background.

I am removing cohort_n_rich_threshold from this list since it is not applied cohort-wise but has some sample specificity. This filter would benefit from the sample specific filtering with the deths (see overall comment)

I am removing repetitive_variant since we do not necessarily want to filter those out (this would benefit from having a parameter to indicate this list of filters to be applied to the background)

Suggested change
FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "repetitive_variant", "cohort_n_rich_threshold", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"]
FILTERS = ["cohort_n_rich", "cohort_n_rich_uni", "other_sample_SNP", "gnomAD_SNP", "nanoseq_snp", "nanoseq_noise"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we agree on using the filter_criteria and filter_criteria_somatic to define which of these flags should be used for masking the panels

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to clarify and make this right. This are the default values of the filters:

filter_criteria = ["notcontains NM20", "notcontains n_rich", "notcontains cohort_n_rich_threshold", "notcontains cohort_n_rich", "notcontains no_pileup_support", "notcontains low_mappability", "notcontains not_covered", "notcontains nanoseq_noise"]
filter_criteria_somatic = ["notcontains gnomAD_SNP", "notcontains nanoseq_snp"]

I was thinking that we could use this same parameters and let the user define which positions to filter depths/panels as well (which will be the ones in the mutations), but using the sample-specific parameters in the depth=0 filtering and the others in this approach. What do you think, is this what you were thinking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it properly yes, basically all the filters that are applied should be removed from the background, either at cohort-level or sample-level.


return maf_df

def extract_flagged_regions_bed(maf_df: pd.DataFrame, maf_file_name: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go for the implementation where we apply sample specific filters, we would need to report this information per sample as well

@m-huertasp
Copy link
Collaborator Author

Thank you for the review Ferriol! I will check the comments tomorrow, but I wanted to answer the general comment so everything is here in case we need it.

You're right, removing a position from the panel makes the regions in the bed to be split.
Regarding the option of just forcing the depths to be 0 for a position, I thought that this option was discarded due to the problems when computing omega and other values using the depth as denominator, I assume this is still a problem, right?. I was following Fede's suggestion in the issue, although it is true that this cannot be applied per sample.
Maybe we could try the option of forcing the depths to be 0 in each specific sample depth when the position is masked. Then add a step in which all the depths = 0 are removed. Could this be a good idea? Maybe we could combine this with the approach we have here so we avoid removing the same positions in all the samples each time?

About the ymls, I think it could be useful for new people trying to understand how the pipeline works. But I am okay with moving them to the clean branch.

@FerriolCalvet
Copy link
Collaborator

FerriolCalvet commented Jan 20, 2026

Some time ago I made some fixes to prevent the 0s problem Fede experienced above (4551e17), so hopefully this should not be a problem now. I don't remember if there were other methods for which this was also a problem but I think it is worth testing it again.
I would try to avoid having to filter positions from panels again after this first masking steps, but we can see how things work.

And yes, agree with the ymls, I would leave the ones you created here and then try to add updated ones for all steps in the clean branch.

let's discuss soon!

@m-huertasp
Copy link
Collaborator Author

We have decided to try applying the option of forcing the depth to 0 for those positions to be removed in specific samples. If it is too complex, we could merge this branch and do it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss Any potential implementation that requires discussion new-feature New functionality being added.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mask sites where mutations have been discarded

3 participants