.. _calcFunc: ================================================== Session 8: Counting reads and calculating profiles ================================================== In the last parts of the course, you have used the analysis module of the `IMP3 `_ pipeline. In this session, you see how IMP3 calculates functional profiles and the depth of coverage of genes. Data ---- In this session, you can use a .gff file and .bam file prepared on crunchomics (or you can use your own output from :ref:`session 06`): .. code-block:: console cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES ls The commands you will run are taken from the IMP3 workflow, so the results should be similar to the results you got in :ref:`session 06`. Running featureCounts for all genes ----------------------------------- As the first part of the session, we will use `feature counts `_ to extract the number of reads per called gene. Change to your own directory to write your outputs. You may also want to start an interactive job on crunchomics (replacing ``omics-cn003`` with whatever node is empty). .. code-block:: console cd ~/personal sinfo -o "%n %e %m %a %c %C" srun -c 4 -n 1 -N 1 --mem=4G -w omics-cn003 --pty bash The idea of featureCounts is very simple: you supply the GFF file, which contains the positions of the genes (or open reading frames), and the `.bam file `_, which contains for all the reads the positions where they map. Feature counts just counts how many reads map on each of the open reading frames. In the code below, we use the IMP3 conda environment which contains featureCounts from the subread software package (``IMP_annotation.yaml``). .. code-block:: console conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/8fab667c26067cc13be382368594bb18 First, check out the help menu of ``featureCounts`` to make sure you understand the code: .. code-block:: console featureCounts -h Then run ``featureCounts`` (no need to change anything in the command, unless you deviated from the tutorial so far): .. code-block:: console featureCounts -p -O -t CDS -g ID -o mg.CDS_counts.tsv -s 0 -a /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff -T 4 /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.reads.sorted.bam Your output (in ``mg.CDS_counts.tsv``) contains a line with the full call to featureCounts and a header line and then one line for all features in the GFF file. As you can see in the header line, the first field in every line gives the actual feature and the last column contains the number of reads per feature. Running featureCounts for KO profiles ------------------------------------- featureCounts can also get numbers of reads per level of the same annotations, e.g. KOs. To run the calculation for the open reading frames that were annotated with a KO, we first need to filter the GFF file, because featureCounts is a bit unflexible with its input. .. code-block:: console grep "kegg_ko=" /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff >> annotation_KEGG.gff .. admonition:: Task Now, you can run the ``featureCounts`` command using ``kegg_ko`` as the ``-g`` argument, and with the filtered GFF file as input. Name the output ``mg.KEGG_counts.tsv``. The rest of the command stays as above. Have a look at the output to see what you've generated. .. admonition:: Comment Files with counts generated in this way can be used for differential abundance analysis, e.g. by `DESeq2 `_. Here is an `example script `_ to summarize featureCounts outputs from several assembled samples. Calculating average depth of coverage values -------------------------------------------- When working with functional profiles, sometimes you will be interested in how many copies you had of a gene and how abundant they are. For example, you may have identified a function of interest. With your GFF file, it is quite easy to identify the genes that have this function (see :ref:`session 07 `). Now, you could look at the number of reads that map to each of those genes in the featureCounts output from the start of this lesson. But this only tells you how many reads overlapped. If the genes have different lengths, this is not the best measure of their abundance. In this case, you would want to have the average coverage. Here is how you can calculate it. We use the same conda environment as above. First, we use a small script to get the length of all contigs. We need this to make a nicely formatted file with all the genes. .. code-block:: console export PERL5LIB=$CONDA_PREFIX/lib/site_perl/5.26.2 perl /zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/fastaNamesSizes.pl /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.assembly.merged.fa > mg.assembly.length.txt conda deactivate For the next step, you need to use a different conda environment, which includes the `bedtools `_ suite. bedtools is good juggling coordinates (positions) on the assemblies. .. code-block:: console conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/76e02d85877600c41ac84cb7bc27a189 This step will make a histogram file which contains for every gene the number of positions with a given coverage. .. code-block:: console sortBed -g mg.assembly.length.txt -i /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff |\ awk '$4 < $5' |\ coverageBed -hist -sorted -g mg.assembly.length.txt -b /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.reads.sorted.bam -a "stdin" |\ grep -v "^all" > helper_bedfile paste <(cat helper_bedfile | cut -f9 | cut -f1 -d ";" | sed -e "s/ID=//g") \ <(cut -f10,11,12,13 helper_bedfile) > mg.gene_depth.hist A small script combines the histograms to an average value per gene: .. code-block:: console perl /zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/calcAvgCoverage.pl mg.gene_depth.hist > mg.gene_depth.avg conda deactivate In addition to looking at the coverage for specific genes of interest, you can use this file for example for the kind of overvire visualization the IMP3 report gave you. .. admonition:: Perl If you want to, have a look at the two scripts used today ``fastaNamesSizes.pl`` and ``calcAvgCoverage.pl`` in ``/zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/`` to see some simple `Perl `_ code. End of today's lesson. Next time, we will leave the functional questions behind and turn to genome reconstructions (aka MAGs).