Session 8: Counting reads and calculating profiles

In the last parts of the course, you have used the analysis module of the IMP3 pipeline. In this session, you see how IMP3 calculates functional profiles and the depth of coverage of genes.

Data

In this session, you can use a .gff file and .bam file prepared on crunchomics (or you can use your own output from session 06):

cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES
ls

The commands you will run are taken from the IMP3 workflow, so the results should be similar to the results you got in session 06.

Running featureCounts for all genes

As the first part of the session, we will use feature counts to extract the number of reads per called gene.

Change to your own directory to write your outputs. You may also want to start an interactive job on crunchomics (replacing omics-cn003 with whatever node is empty).

cd ~/personal
sinfo -o "%n %e %m %a %c %C"
srun -c 4 -n 1 -N 1 --mem=4G -w omics-cn003 --pty bash

The idea of featureCounts is very simple: you supply the GFF file, which contains the positions of the genes (or open reading frames), and the .bam file, which contains for all the reads the positions where they map. Feature counts just counts how many reads map on each of the open reading frames. In the code below, we use the IMP3 conda environment which contains featureCounts from the subread software package (IMP_annotation.yaml).

conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/8fab667c26067cc13be382368594bb18

First, check out the help menu of featureCounts to make sure you understand the code:

featureCounts -h

Then run featureCounts (no need to change anything in the command, unless you deviated from the tutorial so far):

featureCounts -p -O -t CDS -g ID -o mg.CDS_counts.tsv -s 0 -a /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff -T 4 /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.reads.sorted.bam

Your output (in mg.CDS_counts.tsv) contains a line with the full call to featureCounts and a header line and then one line for all features in the GFF file. As you can see in the header line, the first field in every line gives the actual feature and the last column contains the number of reads per feature.

Running featureCounts for KO profiles

featureCounts can also get numbers of reads per level of the same annotations, e.g. KOs. To run the calculation for the open reading frames that were annotated with a KO, we first need to filter the GFF file, because featureCounts is a bit unflexible with its input.

grep "kegg_ko=" /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff >> annotation_KEGG.gff

Task

Now, you can run the featureCounts command using kegg_ko as the -g argument, and with the filtered GFF file as input. Name the output mg.KEGG_counts.tsv. The rest of the command stays as above.

Have a look at the output to see what you’ve generated.

Comment

Files with counts generated in this way can be used for differential abundance analysis, e.g. by DESeq2. Here is an example script to summarize featureCounts outputs from several assembled samples.

Calculating average depth of coverage values

When working with functional profiles, sometimes you will be interested in how many copies you had of a gene and how abundant they are. For example, you may have identified a function of interest. With your GFF file, it is quite easy to identify the genes that have this function (see session 07). Now, you could look at the number of reads that map to each of those genes in the featureCounts output from the start of this lesson. But this only tells you how many reads overlapped. If the genes have different lengths, this is not the best measure of their abundance. In this case, you would want to have the average coverage.

Here is how you can calculate it.

We use the same conda environment as above.

First, we use a small script to get the length of all contigs. We need this to make a nicely formatted file with all the genes.

export PERL5LIB=$CONDA_PREFIX/lib/site_perl/5.26.2
perl /zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/fastaNamesSizes.pl  /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.assembly.merged.fa > mg.assembly.length.txt
conda deactivate

For the next step, you need to use a different conda environment, which includes the bedtools suite. bedtools is good juggling coordinates (positions) on the assemblies.

conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/76e02d85877600c41ac84cb7bc27a189

This step will make a histogram file which contains for every gene the number of positions with a given coverage.

sortBed -g mg.assembly.length.txt -i /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/annotation_CDS_RNA_hmms.gff |\
      awk '$4 < $5' |\
      coverageBed -hist -sorted -g mg.assembly.length.txt -b /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/PROFILES/mg.reads.sorted.bam -a "stdin" |\
      grep -v "^all" > helper_bedfile

paste <(cat helper_bedfile | cut -f9 | cut -f1 -d ";" | sed -e "s/ID=//g") \
      <(cut -f10,11,12,13 helper_bedfile) > mg.gene_depth.hist

A small script combines the histograms to an average value per gene:

perl /zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/calcAvgCoverage.pl mg.gene_depth.hist > mg.gene_depth.avg
conda deactivate

In addition to looking at the coverage for specific genes of interest, you can use this file for example for the kind of overvire visualization the IMP3 report gave you.

Perl

If you want to, have a look at the two scripts used today fastaNamesSizes.pl and calcAvgCoverage.pl in /zfs/omics/projects/metatools/TOOLS/IMP3/workflow/scripts/ to see some simple Perl code.

End of today’s lesson. Next time, we will leave the functional questions behind and turn to genome reconstructions (aka MAGs).