diff --git a/README.md b/README.md index a05b39f..2e8fc6b 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,7 @@ * [Pairwise Alignment and Dotplot](Pairwise_Alignment/pairwise_alignment.jl) ([![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/BioJulia/BioTutorials/HEAD?labpath=Pairwise_Alignment%2Fpairwise_alignment.ipynb)) * [RNA-Seq Coverage](RNASeq_Coverage/rnaseq_coverage.jl) (outdated) +* [RNASeqTools TSS](RNASeqTools_TSS/rnaseqtools_tss.ipynb) The programs under this repository are distributed in the public domain unless otherwise specified. diff --git a/RNASeqTools_TSS/Project.toml b/RNASeqTools_TSS/Project.toml new file mode 100644 index 0000000..6e5f0fe --- /dev/null +++ b/RNASeqTools_TSS/Project.toml @@ -0,0 +1,7 @@ +name = "RNASeqTools_TSS" +uuid = "22ec6cf8-324d-4466-8826-e36db40d94b7" +authors = ["Malte Siemers "] +version = "0.1.0" + +[deps] +RNASeqTools = "c929925e-1c85-4562-9573-d68085edd19a" diff --git a/RNASeqTools_TSS/rnaseqtools_tss.ipynb b/RNASeqTools_TSS/rnaseqtools_tss.ipynb new file mode 100644 index 0000000..22e5ee4 --- /dev/null +++ b/RNASeqTools_TSS/rnaseqtools_tss.ipynb @@ -0,0 +1,484 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# RNASeq data analysis using RNASeqTools\n", + "\n", + "Requirements:\n", + "\n", + "* Julia 1.8.5\n", + "* RNASeqTools 0.3.4\n", + "* sra-tools 3.0.3\n", + "* samtools 1.6\n", + "* bwa-mem2 2.2.1\n", + "\n", + "In this tutorial, we will first download a few files (genome, annotation, sequencing reads) with sra-tools.\n", + "We will then align the reads to the genome using bwa-mem2. We will then use the various Packages from BioJulia\n", + "through the interface of RNASeqTools to load the alignments file, compute the coverage from it and do some\n", + "simple signal detection in the coverage." + ], + "metadata": {} + }, + { + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Updating git-repo `https://github.com/maltesie/RNASeqTools#v0.3.4`\n", + " Resolving package versions...\n", + " No Changes to `~/BioTutorials/Project.toml`\n", + " No Changes to `~/BioTutorials/Manifest.toml`\n" + ] + } + ], + "cell_type": "code", + "source": [ + "using Pkg\n", + "Pkg.add(url=\"https://github.com/maltesie/RNASeqTools#v0.3.4\")\n", + "\n", + "using RNASeqTools" + ], + "metadata": {}, + "execution_count": 1 + }, + { + "cell_type": "markdown", + "source": [ + "## Downloading the data\n", + "\n", + "Our data will be from V. Cholerae, we need a genome and its annotation: First go to [NCBI](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_013085075.1/),\n", + "after clicking Download, select .fasta and .gff and click Download again. Then extract the content of the archive\n", + "directory /ncbi_dataset/data/GCF_013085075.1/ into the folder containing this notebook. The folder should then\n", + "look like:\n" + ], + "metadata": {} + }, + { + "outputs": [], + "cell_type": "code", + "source": [ + "# RNASeqTools_TSS\n", + "#   ├── GCF_013085075.1_ASM1308507v1_genomic.fna\n", + "#   ├── genomic.gff\n", + "#   ├── Manifest.toml\n", + "#   ├── Project.toml\n", + "#   ├── rnaseqtools_tss.ipynb\n", + "#   └── rnaseqtools_tss.jl" + ], + "metadata": {}, + "execution_count": 2 + }, + { + "cell_type": "markdown", + "source": [ + "NCBI stores reads in its internal format SRA, we can use prefetch and fastq-dump from the sra-tools to\n", + "download reads and convert them to the fastq format. The reads in sample SRR1602510 come from a TEX-treated\n", + "sample and can be used to identify primary transcription start sites." + ], + "metadata": {} + }, + { + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "2023-04-16T14:16:23 prefetch.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.\n", + "2023-04-16T14:16:23 prefetch.3.0.2: 1) Downloading 'SRR1602510'...\n", + "2023-04-16T14:16:23 prefetch.3.0.2: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.\n", + "2023-04-16T14:16:23 prefetch.3.0.2: Downloading via HTTPS...\n", + "2023-04-16T14:16:58 prefetch.3.0.2: HTTPS download succeed\n", + "2023-04-16T14:16:59 prefetch.3.0.2: 'SRR1602510' is valid\n", + "2023-04-16T14:16:59 prefetch.3.0.2: 1) 'SRR1602510' was downloaded successfully\n", + "Read 6686697 spots for ~/BioTutorials/RNASeqTools_TSS/SRR1602510.sra\n", + "Written 6686697 spots for ~/BioTutorials/RNASeqTools_TSS/SRR1602510.sra\n" + ] + } + ], + "cell_type": "code", + "source": [ + "download_prefetch([\"SRR1602510\"], @__DIR__)" + ], + "metadata": {}, + "execution_count": 3 + }, + { + "cell_type": "markdown", + "source": [ + "## Aligning reads with bwa-mem\n", + "\n", + "RNASeqTools provides a wrapper to bwa-mem2. We will use the genome file from the NCBI dataset downloaded earlier.\n", + "align_mem also calls samtools to sort and index the .bam file and creates a .log file containing a useful report\n", + "from samtools stats" + ], + "metadata": {} + }, + { + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[bwa_index] Pack FASTA... 0.02 sec\n", + "* Entering FMI_search\n", + "init ticks = 481045140\n", + "ref seq len = 8170862\n", + "binary seq ticks = 273930990\n", + "build suffix-array ticks = 3818121870\n", + "ref_seq_len = 8170862\n", + "count = 0, 2146471, 4085431, 6024391, 8170862\n", + "BWT[5860163] = 4\n", + "CP_SHIFT = 6, CP_MASK = 63\n", + "sizeof CP_OCC = 64\n", + "pos: 1021358, ref_seq_len__: 1021357\n", + "max_occ_ind = 127669\n", + "build fm-index ticks = 1029061770\n", + "Total time taken: 1.9117\n", + "-----------------------------\n", + "Executing in SSE4.2 mode!!\n", + "-----------------------------\n", + "* SA compression enabled with xfactor: 8\n", + "* Ref file: ~/BioTutorials/RNASeqTools_TSS/GCF_013085075.1_ASM1308507v1_genomic.fna\n", + "* Entering FMI_search\n", + "* Index file found. Loading index from ~/BioTutorials/RNASeqTools_TSS/GCF_013085075.1_ASM1308507v1_genomic.fna.bwt.2bit.64\n", + "* Reference seq len for bi-index = 8170863\n", + "* sentinel-index: 5860163\n", + "* Count:\n", + "0,\t1\n", + "1,\t2146472\n", + "2,\t4085432\n", + "3,\t6024392\n", + "4,\t8170863\n", + "\n", + "* Reading other elements of the index from files ~/BioTutorials/RNASeqTools_TSS/GCF_013085075.1_ASM1308507v1_genomic.fna\n", + "* Index prefix: ~/BioTutorials/RNASeqTools_TSS/GCF_013085075.1_ASM1308507v1_genomic.fna\n", + "* Read 0 ALT contigs\n", + "* Done reading Index!!\n", + "* Reading reference genome..\n", + "* Binary seq file = ~/BioTutorials/RNASeqTools_TSS/GCF_013085075.1_ASM1308507v1_genomic.fna.0123\n", + "* Reference genome size: 8170862 bp\n", + "* Done reading reference genome !!\n", + "\n", + "------------------------------------------\n", + "1. Memory pre-allocation for Chaining: 836.0454 MB\n", + "2. Memory pre-allocation for BSW: 1437.7021 MB\n", + "3. Memory pre-allocation for BWT: 463.8851 MB\n", + "------------------------------------------\n", + "* Threads used (compute): 6\n", + "* No. of pipeline threads: 2\n", + "\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000108, nseq: 612246\n", + "\t[0000][ M::kt_pipeline] read 612246 sequences (60000108 bp)...\n", + "[0000] Reallocating initial memory allocations!!\n", + "[0000] Calling mem_process_seqs.., task: 0\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000108, nseq: 612246\n", + "\t[0000][ M::kt_pipeline] read 612246 sequences (60000108 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 612246 reads in 37.874 CPU sec, 6.111 real sec\n", + "[0000] Calling mem_process_seqs.., task: 1\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000108, nseq: 612246\n", + "\t[0000][ M::kt_pipeline] read 612246 sequences (60000108 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 612246 reads in 26.144 CPU sec, 4.164 real sec\n", + "[0000] Calling mem_process_seqs.., task: 2\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000022, nseq: 605572\n", + "\t[0000][ M::kt_pipeline] read 605572 sequences (60000022 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 612246 reads in 24.630 CPU sec, 3.970 real sec\n", + "[0000] Calling mem_process_seqs.., task: 3\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 605572 reads in 26.113 CPU sec, 4.178 real sec\n", + "[0000] Calling mem_process_seqs.., task: 4\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 26.549 CPU sec, 4.292 real sec\n", + "[0000] Calling mem_process_seqs.., task: 5\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 28.320 CPU sec, 4.784 real sec\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] Calling mem_process_seqs.., task: 6\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 24.202 CPU sec, 3.937 real sec\n", + "[0000] Calling mem_process_seqs.., task: 7\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 25.809 CPU sec, 4.162 real sec\n", + "[0000] Calling mem_process_seqs.., task: 8\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 26.271 CPU sec, 4.180 real sec\n", + "[0000] Calling mem_process_seqs.., task: 9\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] read_chunk: 60000000, work_chunk_size: 60000000, nseq: 600000\n", + "\t[0000][ M::kt_pipeline] read 600000 sequences (60000000 bp)...\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 26.599 CPU sec, 4.232 real sec\n", + "[0000] Calling mem_process_seqs.., task: 10\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 600000 reads in 28.122 CPU sec, 4.761 real sec\n", + "[0000] read_chunk: 60000000, work_chunk_size: 4438700, nseq: 44387\n", + "\t[0000][ M::kt_pipeline] read 44387 sequences (4438700 bp)...\n", + "[0000] Calling mem_process_seqs.., task: 11\n", + "[0000] 1. Calling kt_for - worker_bwt\n", + "[0000] 2. Calling kt_for - worker_aln\n", + "[0000] read_chunk: 60000000, work_chunk_size: 0, nseq: 0\n", + "[0000] 3. Calling kt_for - worker_sam\n", + "\t[0000][ M::mem_process_seqs] Processed 44387 reads in 2.363 CPU sec, 0.422 real sec\n", + "[0000] read_chunk: 60000000, work_chunk_size: 0, nseq: 0\n", + "[0000] Computation ends..\n", + "No. of OMP threads: 6\n", + "Processor is running @2995.297530 MHz\n", + "Runtime profile:\n", + "\n", + "\tTime taken for main_mem function: 54.84 sec\n", + "\n", + "\tIO times (sec) :\n", + "\tReading IO time (reads) avg: 13.04, (13.04, 13.04)\n", + "\tWriting IO time (SAM) avg: 16.40, (16.40, 16.40)\n", + "\tReading IO time (Reference Genome) avg: 0.00, (0.00, 0.00)\n", + "\tIndex read time avg: 0.01, (0.01, 0.01)\n", + "\n", + "\tOverall time (sec) (Excluding Index reading time):\n", + "\tPROCESS() (Total compute time + (read + SAM) IO time) : 54.82\n", + "\tMEM_PROCESS_SEQ() (Total compute time (Kernel + SAM)), avg: 49.18, (49.18, 49.18)\n", + "\n", + "\t SAM Processing time (sec):\n", + "\t--WORKER_SAM avg: 3.25, (3.25, 3.25)\n", + "\n", + "\tKernels' compute time (sec):\n", + "\tTotal kernel (smem+sal+bsw) time avg: 45.92, (45.92, 45.92)\n", + "\t\tSMEM compute avg: 13.50, (13.61, 13.41)\n", + "\t\tSAL compute avg: 4.46, (4.51, 4.43)\n", + "\t\t\t\tMEM_SA avg: 2.68, (2.69, 2.67)\n", + "\n", + "\t\tBSW time, avg: 26.45, (26.46, 26.43)\n", + "\n", + "Important parameter settings: \n", + "\tBATCH_SIZE: 512\n", + "\tMAX_SEQ_LEN_REF: 256\n", + "\tMAX_SEQ_LEN_QER: 128\n", + "\tMAX_SEQ_LEN8: 128\n", + "\tSEEDS_PER_READ: 500\n", + "\tSIMD_WIDTH8 X: 16\n", + "\tSIMD_WIDTH16 X: 8\n", + "\tAVG_SEEDS_PER_READ: 64\n", + "[bam_sort_core] merging from 2 files and 1 in-memory blocks...\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": "Base.ProcessChain(Base.Process[Process(`\u001b[4msamtools\u001b[24m \u001b[4mindex\u001b[24m \u001b[4m~/BioTutorials/RNASeqTools_TSS/SRR1602510_1.bam\u001b[24m`, ProcessExited(0)), Process(`\u001b[4msamtools\u001b[24m \u001b[4mstats\u001b[24m \u001b[4m~/BioTutorials/RNASeqTools_TSS/SRR1602510_1.bam\u001b[24m`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())" + }, + "metadata": {}, + "execution_count": 4 + } + ], + "cell_type": "code", + "source": [ + "align_mem(joinpath(@__DIR__, \"SRR1602510_1.fastq.gz\"), joinpath(@__DIR__, \"GCF_013085075.1_ASM1308507v1_genomic.fna\"))" + ], + "metadata": {}, + "execution_count": 4 + }, + { + "cell_type": "markdown", + "source": [ + "## Annotation of alignments and exploration of the data\n", + "\n", + "To annotate the alignments, we need to read the alignments file and the annotation file, then we can\n", + "explore the annotated alignments." + ], + "metadata": {} + }, + { + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The first 5 alignments with annotation:\n", + "\n", + "Single Alignment with 1 part(s):\n", + " [1, 59] on read1 - [1929524, 1929582] on NZ_CP047295.1 (-) with edit distance 0 - IGR:cds-WP_001048829.1:cds-WP_001194704.1 (69% in annotation)\n", + "Single Alignment with 1 part(s):\n", + " [1, 41] on read1 - [2589427, 2589467] on NZ_CP047295.1 (+) with edit distance 0 - IGR:cds-WP_001129565.1:cds-WP_001068472.1 (100% in annotation)\n", + "Single Alignment with 1 part(s):\n", + " [1, 96] on read1 - [2807000, 2807095] on NZ_CP047295.1 (-) with edit distance 0 - rRNA:rna-GTF72_RS12795 (99% in annotation)\n", + "Single Alignment with 1 part(s):\n", + " [1, 100] on read1 - [1031772, 1031871] on NZ_CP047296.1 (+) with edit distance 0 - IGR:cds-WP_001882860.1:cds-WP_001132600.1 (59% in annotation)\n", + "Single Alignment with 1 part(s):\n", + " [1, 100] on read1 - [486446, 486545] on NZ_CP047295.1 (+) with edit distance 0 - 5UTR:cds-WP_000567318.1 (100% in annotation)\n", + "\n", + "From 5899560 mapped fragments, 2002875 map to intergenic regions, 2074559 to ribosomal RNA and 254223 to tRNAs.\n" + ] + } + ], + "cell_type": "code", + "source": [ + "#read features with types [\"rRNA\", \"tRNA\", \"CDS\"] from the .gff file nad use [\"ID\", \"Name\", \"locus_tag\"] parameters to name the feature\n", + "ann = Features(joinpath(@__DIR__, \"genomic.gff\"), [\"rRNA\", \"tRNA\", \"CDS\"]; name_keys=[\"ID\", \"Name\", \"locus_tag\"])\n", + "\n", + "#add annotations for the regions upstream (\"5UTR\") and downstream (\"3UTR\") of each \"CDS\"\n", + "addutrs!(ann; cds_type=\"CDS\", five_type=\"5UTR\", five_length=200, three_type=\"3UTR\", three_length=100)\n", + "\n", + "#fill all regions without annotation with \"IGR\"\n", + "addigrs!(ann; igr_type=\"IGR\")\n", + "\n", + "alignments = AlignedReads(joinpath(@__DIR__, \"SRR1602510_1.bam\"))\n", + "\n", + "#annotate each alignment with the feature with the largest overlap.\n", + "annotate!(alignments, ann)\n", + "\n", + "println(\"The first 5 alignments with annotation:\\n\")\n", + "for (i, aln) in enumerate(alignments)\n", + " show(aln)\n", + " i>=5 && break\n", + "end\n", + "\n", + "#count for each annotation type the number of alignments with a matching annotation\n", + "counter = Dict(t=>0 for t in types(ann))\n", + "for aln in alignments\n", + " for fragment in aln\n", + " counter[type(fragment)] += 1\n", + " end\n", + "end\n", + "println(\"\\nFrom $(length(alignments)) mapped fragments, $(counter[\"IGR\"]) map to intergenic regions, $(counter[\"rRNA\"]) to ribosomal RNA and $(counter[\"tRNA\"]) to tRNAs.\")" + ], + "metadata": {}, + "execution_count": 5 + }, + { + "cell_type": "markdown", + "source": [ + "## Computation of coverage and potential TSS from it" + ], + "metadata": {} + }, + { + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": "RNASeqTools.Coverage on 2 reference sequences" + }, + "metadata": {}, + "execution_count": 6 + } + ], + "cell_type": "code", + "source": [ + "#coverage is computed with a default normalization of 1/Million reads\n", + "forward_file, reverse_file = compute_coverage(joinpath(@__DIR__, \"SRR1602510_1.bam\"))\n", + "coverage = Coverage(forward_file, reverse_file)\n", + "\n", + "#if we dont want to put the coverage into files first, we can compute it directly from the alignments:\n", + "coverage = Coverage(alignments)" + ], + "metadata": {}, + "execution_count": 6 + }, + { + "cell_type": "markdown", + "source": [ + "Now we can call peaks in the difference along the coverage as positions of putative TSS. The maxdiffpositions\n", + "function computes maxima in the difference along the coverage with a minimum step parameter and a\n", + "minimum ratio towards the background within a certain region around the maximum." + ], + "metadata": {} + }, + { + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found 5138 putative TSS\n", + "The first 5 TSS according to the set parameters:\n", + "\n", + "Feature: [159, 159] on NZ_CP047295.1 (-) - annotation: DIFF:reverse_1\n", + "parameters: Name=reverse_1;diff=10.509258317569445;from=NA;height=10.509258317569445\n", + "Feature: [1424, 1424] on NZ_CP047295.1 (+) - annotation: DIFF:forward_1\n", + "parameters: Name=forward_1;diff=9.153224986270164;from=NA;height=11.187274983219089\n", + "Feature: [2440, 2440] on NZ_CP047295.1 (+) - annotation: DIFF:forward_2\n", + "parameters: Name=forward_2;diff=43.56257076798948;from=NA;height=45.08810826570117\n", + "Feature: [2891, 2891] on NZ_CP047295.1 (+) - annotation: DIFF:forward_3\n", + "parameters: Name=forward_3;diff=83.90456237414317;from=NA;height=87.8031582016286\n", + "Feature: [3242, 3242] on NZ_CP047295.1 (-) - annotation: DIFF:reverse_2\n", + "parameters: Name=reverse_2;diff=12.54330831451837;from=NA;height=12.712812480930781\n" + ] + } + ], + "cell_type": "code", + "source": [ + "tss_features = maxdiffpositions(coverage; min_diff=5, min_ratio=1.3, compute_within=3)\n", + "println(\"Found $(length(tss_features)) putative TSS\\nThe first 5 TSS according to the set parameters:\\n\")\n", + "for (i, tss) in enumerate(tss_features)\n", + " show(tss)\n", + " i>=5 && break\n", + "end\n", + "\n", + "#maxdiffpositions returns a Features struct that can be saved to disc into a .gff file:\n", + "write(joinpath(@__DIR__, \"tss.gff\"), tss_features)" + ], + "metadata": {}, + "execution_count": 7 + }, + { + "cell_type": "markdown", + "source": [ + "---\n", + "\n", + "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*" + ], + "metadata": {} + } + ], + "nbformat_minor": 3, + "metadata": { + "language_info": { + "file_extension": ".jl", + "mimetype": "application/julia", + "name": "julia", + "version": "1.8.5" + }, + "kernelspec": { + "name": "julia-1.8", + "display_name": "Julia 1.8.5", + "language": "julia" + } + }, + "nbformat": 4 +} diff --git a/RNASeqTools_TSS/rnaseqtools_tss.jl b/RNASeqTools_TSS/rnaseqtools_tss.jl new file mode 100644 index 0000000..cc96d5a --- /dev/null +++ b/RNASeqTools_TSS/rnaseqtools_tss.jl @@ -0,0 +1,106 @@ +# # RNASeq data analysis using RNASeqTools +# +# Requirements: +# +# * Julia 1.8.5 +# * RNASeqTools 0.3.4 +# * sra-tools 3.0.3 +# * samtools 1.6 +# * bwa-mem2 2.2.1 +# +# In this tutorial, we will first download a few files (genome, annotation, sequencing reads) with sra-tools. +# We will then align the reads to the genome using bwa-mem2. We will then use the various Packages from BioJulia +# through the interface of RNASeqTools to load the alignments file, compute the coverage from it and do some +# simple signal detection in the coverage. + +using Pkg +Pkg.add(url="https://github.com/maltesie/RNASeqTools#v0.3.4") + +using RNASeqTools + +# ## Downloading the data +# +# Our data will be from V. Cholerae, we need a genome and its annotation: First go to [NCBI](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_013085075.1/), +# after clicking Download, select .fasta and .gff and click Download again. Then extract the content of the archive +# directory /ncbi_dataset/data/GCF_013085075.1/ into the folder containing this notebook. The folder should then +# look like: +# +# ``` +# RNASeqTools_TSS +#   ├── GCF_013085075.1_ASM1308507v1_genomic.fna +#   ├── genomic.gff +#   ├── Manifest.toml +#   ├── Project.toml +#   ├── rnaseqtools_tss.ipynb +#   └── rnaseqtools_tss.jl +# ``` +# NCBI stores reads in its internal format SRA, we can use prefetch and fastq-dump from the sra-tools to +# download reads and convert them to the fastq format. The reads in sample SRR1602510 come from a TEX-treated +# sample and can be used to identify primary transcription start sites. + +download_prefetch(["SRR1602510"], @__DIR__) + +# ## Aligning reads with bwa-mem +# +# RNASeqTools provides a wrapper to bwa-mem2. We will use the genome file from the NCBI dataset downloaded earlier. +# align_mem also calls samtools to sort and index the .bam file and creates a .log file containing a useful report +# from samtools stats + +align_mem(joinpath(@__DIR__, "SRR1602510_1.fastq.gz"), joinpath(@__DIR__, "GCF_013085075.1_ASM1308507v1_genomic.fna")) + +# ## Annotation of alignments and exploration of the data +# +# To annotate the alignments, we need to read the alignments file and the annotation file, then we can +# explore the annotated alignments. + +## read features with types ["rRNA", "tRNA", "CDS"] from the .gff file nad use ["ID", "Name", "locus_tag"] parameters to name the feature +ann = Features(joinpath(@__DIR__, "genomic.gff"), ["rRNA", "tRNA", "CDS"]; name_keys=["ID", "Name", "locus_tag"]) + +## add annotations for the regions upstream ("5UTR") and downstream ("3UTR") of each "CDS" +addutrs!(ann; cds_type="CDS", five_type="5UTR", five_length=200, three_type="3UTR", three_length=100) + +#fill all regions without annotation with "IGR" +addigrs!(ann; igr_type="IGR") + +alignments = AlignedReads(joinpath(@__DIR__, "SRR1602510_1.bam")) + +#annotate each alignment with the feature with the largest overlap. +annotate!(alignments, ann) + +println("The first 5 alignments with annotation:\n") +for (i, aln) in enumerate(alignments) + show(aln) + i>=5 && break +end + +#count for each annotation type the number of alignments with a matching annotation +counter = Dict(t=>0 for t in types(ann)) +for aln in alignments + for fragment in aln + counter[type(fragment)] += 1 + end +end +println("\nFrom $(length(alignments)) mapped fragments, $(counter["IGR"]) map to intergenic regions, $(counter["rRNA"]) to ribosomal RNA and $(counter["tRNA"]) to tRNAs.") + +# ## Computation of coverage and potential TSS from it + +#coverage is computed with a default normalization of 1/Million reads +forward_file, reverse_file = compute_coverage(joinpath(@__DIR__, "SRR1602510_1.bam")) +coverage = Coverage(forward_file, reverse_file) + +#if we dont want to put the coverage into files first, we can compute it directly from the alignments: +coverage = Coverage(alignments) + +# Now we can call peaks in the difference along the coverage as positions of putative TSS. The maxdiffpositions +# function computes maxima in the difference along the coverage with a minimum step parameter and a +# minimum ratio towards the background within a certain region around the maximum. + +tss_features = maxdiffpositions(coverage; min_diff=5, min_ratio=1.3, compute_within=3) +println("Found $(length(tss_features)) putative TSS\nThe first 5 TSS according to the set parameters:\n") +for (i, tss) in enumerate(tss_features) + show(tss) + i>=5 && break +end + +#maxdiffpositions returns a Features struct that can be saved to disc into a .gff file: +write(joinpath(@__DIR__, "tss.gff"), tss_features)