Reformat OLGA, make GLIPH2 optional #73

dltamayo · 2026-01-26T22:02:19Z

No description provided.

Merge ps-ph

- make gliph2 optional for compare workflow, due to lack of scaling for large datasets

- added Annotate subworkflow for processes run on CDR3 sequences only, independent of sample - reformatted OLGA probability generation calculation to run once on all CDR3 sequences in a cohort in parallelized manner - reformatted usage of OLGA in both Compare (tcrsharing) and Sample to use OLGA lookup table

- starting another round of reformatting to reduce code redundancy

github-actions · 2026-01-26T22:05:36Z

Unit Test Results

10 tests 10 ✅ 3m 29s ⏱️
2 suites 0 💤
1 files 0 ❌

Results for commit abefc74.

♻️ This comment has been updated with latest results.

Copilot

Pull request overview

This PR refactors how OLGA-based generation probabilities are computed and integrated, centralizing annotation in a new subworkflow and wiring those outputs into both sample-level and comparison-level analyses, while also making GLIPH2 execution optional and exposing OLGA chunking as a parameter. It also simplifies sample statistics outputs, updates concatenation formats to TSV, and refreshes configuration/metadata for Cirro and licensing.

Changes:

Introduces an ANNOTATE subworkflow (and underlying modules/local/annotate + modules/local/olga) to generate a sorted, deduplicated CDR3 list, run OLGA in chunks, and emit global concat_cdr3_sorted and cdr3_pgen channels for reuse in SAMPLE and COMPARE.
Refactors COMAPRE/TCR sharing logic to consume the pre-annotated CDR3+pgen data, removes the old COMPARE_CONCATENATE module, and gates GLIPH2 execution behind a new params.use_gliph2 flag.
Normalizes sample statistics outputs (flat CSVs without subdirectories), updates compare concatenation to .tsv with stricter column selection/filters, and exposes olga_chunk_length and GLIPH2 usage as configurable parameters (including Cirro integration).

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
workflows/tcrtoolkit.nf	Wires in new ANNOTATE subworkflow and connects its outputs into SAMPLE and COMPARE; contains a syntax issue in the SAMPLE call that prevents passing `cdr3_pgen` correctly.
subworkflows/local/sample.nf	Extends SAMPLE to take the global `cdr3_pgen` channel and routes it into the new OLGA_SAMPLE_MERGE/OLGA_HISTOGRAM pipeline for sample-level OLGA summaries.
subworkflows/local/pseudobulk_phenotype.nf	Switches phenotype-level concatenation to reuse `ANNOTATE_CONCATENATE` instead of the removed `COMPARE_CONCATENATE` process.
subworkflows/local/compare.nf	Updates COMPARE inputs to consume `concat_cdr3_sorted` and `cdr3_pgen`, skips GLIPH2 unless `params.use_gliph2` is true, and introduces a TCRSHARING_OLGA_MERGE step before TCR-sharing calculations.
subworkflows/local/annotate.nf	New ANNOTATE subworkflow that concatenates, sorts, and deduplicates CDR3s, runs OLGA_CALCULATE in chunks, concatenates pgen outputs, and emits `concat_cdr3_sorted` and `cdr3_pgen`.
nextflow.config	Adds `params.olga_chunk_length` for OLGA chunk size and `params.use_gliph2` to control GLIPH2 execution, aligning pipeline behavior with new options.
modules/local/sample/sample_calc.nf	Adjusts SAMPLE_CALC outputs to write flat CSV filenames without subdirectories and disables publishDir; stub outputs are now inconsistent with the declared output names.
modules/local/olga/main.nf	Replaces the old per-sample OLGA process with a set of processes for chunked OLGA pgen calculation, global concatenation, per-sample merging, and histogram generation; contains critical issues in OLGA_CALCULATE output naming and in handling NaNs when computing global OLGA bounds, plus leftover debug prints.
modules/local/compare/tcrsharing.nf	Simplifies TCRSHARING_CALC to consume pre-joined CDR3+pgen data, attaches pgen/log10_pgen to per-CDR3 sharing metrics, and updates downstream histogram/scatterplot scripts to read the new schema and filename.
modules/local/compare/compare_concatenate.nf	Removes the legacy COMPARE_CONCATENATE process in favor of the new annotate-based concatenation flow.
modules/local/annotate/main.nf	New module defining ANNOTATE_CONCATENATE, sorting, and deduplication processes for CDR3/TRBV; note that the `all_sample_files` input to ANNOTATE_CONCATENATE is declared but unused in the script.
bin/sample_calc.py	Aligns with the new SAMPLE_CALC outputs by writing flat `sample_stats_.csv` and `_family_*.csv` files and simplifying argument parsing for the counts table.
bin/compare_concatenate.py	Tightens concatenation to a focused set of columns, filters to productive/fully-called CDR3s, switches to TSV output, and updates concatenation details; the docstring still mentions `.txt` rather than the new `.tsv`.
LICENSE	Updates the copyright line to `Copyright (c) 2026 Karchin Lab, Break Through Cancer`.
.cirro/process-input.json	Wires `olga_chunk_length` from Cirro dataset parameters into Nextflow params to control OLGA chunk size.
.cirro/process-form.json	Exposes `olga_chunk_length` as an integer parameter in the Cirro form schema for user configuration.

Comments suppressed due to low confidence (4)

modules/local/olga/main.nf:98

The debug print statements print(pgen.head()) and print(df.head()) left in this OLGA_SAMPLE_MERGE script will run for every sample and can generate a large amount of noisy log output in production runs. Consider removing these or guarding them behind a dedicated debug flag or parameter so routine runs stay clean.
modules/local/olga/main.nf:14
In OLGA_CALCULATE, the output file pattern pgen_${cdr3_chunk} and the matching -o "pgen_${cdr3_chunk}" in the script will expand ${cdr3_chunk} to the full path of the input file, producing an output path that includes directory separators (for example, pgen_/work/.../unique_cdr3.txt.0001.txt). Unless those directories already exist, olga-compute_pgen will fail to write the file and Nextflow will not find the declared output. Use only the basename of the chunk path in the output filename (for example, via ${cdr3_chunk.baseName}) so the output is created in the current work directory with a valid name.
modules/local/olga/main.nf:111
When computing OLGA x-axis bounds, log_probs = merged_df['log10_pgen'] operates on a column that can contain NA values for rows where pgen is 0, because those rows are no longer filtered out before this step. Using NumPy min/max on a series containing NaNs will yield NaN bounds, which will propagate into downstream histogram binning and plotting. You should filter out rows with pgen == 0 (or equivalently, drop NaN log10_pgen values) before computing left_bound and right_bound so the global OLGA range is well-defined.
modules/local/olga/main.nf:105
The values count_table and sample_meta.sample are interpolated directly into the Python here-doc (pd.read_csv("${count_table}", ...) and to_csv("${sample_meta.sample}_tcr_pgen.tsv", ...)), so a filename or sample name containing double quotes or Python syntax can break out of the string literal and inject arbitrary Python code. An attacker who controls input file paths or sample names (e.g. via a samplesheet) could craft a name like "; __import__('os').system('id') # to execute arbitrary commands under the workflow user. To mitigate this, avoid embedding these values into the Python source and instead pass them as arguments or environment variables and construct the paths inside Python using trusted strings or proper escaping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

modules/local/sample/sample_calc.nf

Copilot · 2026-01-27T15:28:27Z

modules/local/compare/tcrsharing.nf

    # Drop raw list
-    final_df = grouped[['CDR3b', 'total_samples', 'samples_present']]
+    final_df = grouped[['CDR3b', 'pgen', 'log10_pgen', 'total_samples', 'samples_present']]
    final_df = final_df.sort_values(by="total_samples", ascending=False)

    # Export final list
    final_df.to_csv("cdr3_sharing.tsv", sep="\t", index=False)
    EOF


In TCRSHARING_CALC you now load a table that is expected to already contain pgen and log10_pgen columns (via the OLGA merge step), but the output filename was changed from cdr3_sharing_pgen.tsv to cdr3_sharing.tsv here. Please make sure any documentation, downstream consumers, or plotting scripts do not still expect the old _pgen suffix; otherwise they will fail to find the file even though the schema still includes pgen fields.

modules/local/annotate/main.nf

workflows/tcrtoolkit.nf

modules/local/annotate/main.nf

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dltamayo and others added 5 commits January 23, 2026 11:08

Merge pull request #72 from KarchinLab/main

b976f9e

Merge ps-ph

Add use_gliph2 parameter

ca95489

- make gliph2 optional for compare workflow, due to lack of scaling for large datasets

Update license

72f9c9d

Update sample_calc

1763739

- starting another round of reformatting to reduce code redundancy

Update module path

980c6c1

dimalvovs requested a review from Copilot January 27, 2026 15:11

Copilot started reviewing on behalf of dimalvovs January 27, 2026 15:12 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

dltamayo and others added 5 commits January 29, 2026 11:53

Specify current input from SCRATCH-annotate

78ac24f

Delete unnecessary files

40cf5ad

Update modules/local/annotate/main.nf

9ade10a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update workflows/tcrtoolkit.nf

b83a497

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update modules/local/annotate/main.nf

abefc74

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dltamayo merged commit bed55eb into main Feb 6, 2026
3 checks passed

dltamayo mentioned this pull request Feb 6, 2026

Scaling OLGA and Gliph2 for Compare workflow #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reformat OLGA, make GLIPH2 optional #73

Reformat OLGA, make GLIPH2 optional #73

Uh oh!

dltamayo commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reformat OLGA, make GLIPH2 optional #73

Reformat OLGA, make GLIPH2 optional #73

Uh oh!

Conversation

dltamayo commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Jan 26, 2026 •

edited

Loading