-
Notifications
You must be signed in to change notification settings - Fork 2
Reformat OLGA, make GLIPH2 optional #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge ps-ph
- make gliph2 optional for compare workflow, due to lack of scaling for large datasets
- added Annotate subworkflow for processes run on CDR3 sequences only, independent of sample - reformatted OLGA probability generation calculation to run once on all CDR3 sequences in a cohort in parallelized manner - reformatted usage of OLGA in both Compare (tcrsharing) and Sample to use OLGA lookup table
- starting another round of reformatting to reduce code redundancy
Unit Test Results10 tests 10 ✅ 3m 29s ⏱️ Results for commit abefc74. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors how OLGA-based generation probabilities are computed and integrated, centralizing annotation in a new subworkflow and wiring those outputs into both sample-level and comparison-level analyses, while also making GLIPH2 execution optional and exposing OLGA chunking as a parameter. It also simplifies sample statistics outputs, updates concatenation formats to TSV, and refreshes configuration/metadata for Cirro and licensing.
Changes:
- Introduces an
ANNOTATEsubworkflow (and underlyingmodules/local/annotate+modules/local/olga) to generate a sorted, deduplicated CDR3 list, run OLGA in chunks, and emit globalconcat_cdr3_sortedandcdr3_pgenchannels for reuse in SAMPLE and COMPARE. - Refactors COMAPRE/TCR sharing logic to consume the pre-annotated CDR3+pgen data, removes the old
COMPARE_CONCATENATEmodule, and gates GLIPH2 execution behind a newparams.use_gliph2flag. - Normalizes sample statistics outputs (flat CSVs without subdirectories), updates compare concatenation to
.tsvwith stricter column selection/filters, and exposesolga_chunk_lengthand GLIPH2 usage as configurable parameters (including Cirro integration).
Reviewed changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/tcrtoolkit.nf | Wires in new ANNOTATE subworkflow and connects its outputs into SAMPLE and COMPARE; contains a syntax issue in the SAMPLE call that prevents passing cdr3_pgen correctly. |
| subworkflows/local/sample.nf | Extends SAMPLE to take the global cdr3_pgen channel and routes it into the new OLGA_SAMPLE_MERGE/OLGA_HISTOGRAM pipeline for sample-level OLGA summaries. |
| subworkflows/local/pseudobulk_phenotype.nf | Switches phenotype-level concatenation to reuse ANNOTATE_CONCATENATE instead of the removed COMPARE_CONCATENATE process. |
| subworkflows/local/compare.nf | Updates COMPARE inputs to consume concat_cdr3_sorted and cdr3_pgen, skips GLIPH2 unless params.use_gliph2 is true, and introduces a TCRSHARING_OLGA_MERGE step before TCR-sharing calculations. |
| subworkflows/local/annotate.nf | New ANNOTATE subworkflow that concatenates, sorts, and deduplicates CDR3s, runs OLGA_CALCULATE in chunks, concatenates pgen outputs, and emits concat_cdr3_sorted and cdr3_pgen. |
| nextflow.config | Adds params.olga_chunk_length for OLGA chunk size and params.use_gliph2 to control GLIPH2 execution, aligning pipeline behavior with new options. |
| modules/local/sample/sample_calc.nf | Adjusts SAMPLE_CALC outputs to write flat CSV filenames without subdirectories and disables publishDir; stub outputs are now inconsistent with the declared output names. |
| modules/local/olga/main.nf | Replaces the old per-sample OLGA process with a set of processes for chunked OLGA pgen calculation, global concatenation, per-sample merging, and histogram generation; contains critical issues in OLGA_CALCULATE output naming and in handling NaNs when computing global OLGA bounds, plus leftover debug prints. |
| modules/local/compare/tcrsharing.nf | Simplifies TCRSHARING_CALC to consume pre-joined CDR3+pgen data, attaches pgen/log10_pgen to per-CDR3 sharing metrics, and updates downstream histogram/scatterplot scripts to read the new schema and filename. |
| modules/local/compare/compare_concatenate.nf | Removes the legacy COMPARE_CONCATENATE process in favor of the new annotate-based concatenation flow. |
| modules/local/annotate/main.nf | New module defining ANNOTATE_CONCATENATE, sorting, and deduplication processes for CDR3/TRBV; note that the all_sample_files input to ANNOTATE_CONCATENATE is declared but unused in the script. |
| bin/sample_calc.py | Aligns with the new SAMPLE_CALC outputs by writing flat sample_stats_*.csv and *_family_*.csv files and simplifying argument parsing for the counts table. |
| bin/compare_concatenate.py | Tightens concatenation to a focused set of columns, filters to productive/fully-called CDR3s, switches to TSV output, and updates concatenation details; the docstring still mentions .txt rather than the new .tsv. |
| LICENSE | Updates the copyright line to Copyright (c) 2026 Karchin Lab, Break Through Cancer. |
| .cirro/process-input.json | Wires olga_chunk_length from Cirro dataset parameters into Nextflow params to control OLGA chunk size. |
| .cirro/process-form.json | Exposes olga_chunk_length as an integer parameter in the Cirro form schema for user configuration. |
Comments suppressed due to low confidence (4)
modules/local/olga/main.nf:98
- The debug print statements
print(pgen.head())andprint(df.head())left in this OLGA_SAMPLE_MERGE script will run for every sample and can generate a large amount of noisy log output in production runs. Consider removing these or guarding them behind a dedicated debug flag or parameter so routine runs stay clean.
modules/local/olga/main.nf:14 - In OLGA_CALCULATE, the output file pattern
pgen_${cdr3_chunk}and the matching-o "pgen_${cdr3_chunk}"in the script will expand${cdr3_chunk}to the full path of the input file, producing an output path that includes directory separators (for example,pgen_/work/.../unique_cdr3.txt.0001.txt). Unless those directories already exist, olga-compute_pgen will fail to write the file and Nextflow will not find the declared output. Use only the basename of the chunk path in the output filename (for example, via${cdr3_chunk.baseName}) so the output is created in the current work directory with a valid name.
modules/local/olga/main.nf:111 - When computing OLGA x-axis bounds,
log_probs = merged_df['log10_pgen']operates on a column that can contain NA values for rows where pgen is 0, because those rows are no longer filtered out before this step. Using NumPy min/max on a series containing NaNs will yield NaN bounds, which will propagate into downstream histogram binning and plotting. You should filter out rows with pgen == 0 (or equivalently, drop NaN log10_pgen values) before computingleft_boundandright_boundso the global OLGA range is well-defined.
modules/local/olga/main.nf:105 - The values
count_tableandsample_meta.sampleare interpolated directly into the Python here-doc (pd.read_csv("${count_table}", ...)andto_csv("${sample_meta.sample}_tcr_pgen.tsv", ...)), so a filename or sample name containing double quotes or Python syntax can break out of the string literal and inject arbitrary Python code. An attacker who controls input file paths or sample names (e.g. via a samplesheet) could craft a name like"; __import__('os').system('id') #to execute arbitrary commands under the workflow user. To mitigate this, avoid embedding these values into the Python source and instead pass them as arguments or environment variables and construct the paths inside Python using trusted strings or proper escaping.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Drop raw list | ||
| final_df = grouped[['CDR3b', 'total_samples', 'samples_present']] | ||
| final_df = grouped[['CDR3b', 'pgen', 'log10_pgen', 'total_samples', 'samples_present']] | ||
| final_df = final_df.sort_values(by="total_samples", ascending=False) | ||
|
|
||
| # Export final list | ||
| final_df.to_csv("cdr3_sharing.tsv", sep="\t", index=False) | ||
| EOF |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TCRSHARING_CALC you now load a table that is expected to already contain pgen and log10_pgen columns (via the OLGA merge step), but the output filename was changed from cdr3_sharing_pgen.tsv to cdr3_sharing.tsv here. Please make sure any documentation, downstream consumers, or plotting scripts do not still expect the old _pgen suffix; otherwise they will fail to find the file even though the schema still includes pgen fields.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.