Session 7: Take a look at annotations

In the last part of the course, you have used the analysis module of the IMP3 pipeline. In this session, you can explore the annotations a bit further.

Data

In this session, you can use a .gff file prepared on crunchomics:

cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/ANNOTATION
ls

Quick look at the annotations

Here are a couple of bash tricks you can use to get a quick overview of the annotations.

Take a look at the file:

head annotation_CDS_RNA_hmms.gff
tail -n 5 annotation_CDS_RNA_hmms.gff
less annotation_CDS_RNA_hmms.gff

(You exit less with q)

How many annotated sequences?

wc -l annotation_CDS_RNA_hmms.gff

How many are CDS?

grep "   CDS     " -c annotation_CDS_RNA_hmms.gff

What fields are the in column 3?

cut -f 3 annotation_CDS_RNA_hmms.gff | sort | uniq -c

You could have also looked which other fields there are in column 3:

grep "   CDS     " -v annotation_CDS_RNA_hmms.gff | cut -f 3 | sort | uniq -c

How many sequences are annotated with a KEGG KO?

grep "kegg_ko" -c annotation_CDS_RNA_hmms.gff

Which KO is most common?

grep "kegg_ko" annotation_CDS_RNA_hmms.gff | sed 's#.*kegg_ko=\([^;]*\).*#\1#' | tr ',' '\n' | sort | uniq -c | sort -n

Do it yourself (easy)

Try to adapt the code above to the Pfam domains.

Do it yourself (not so easy)

Try to find the cog annotations of the genes are annotated with the most common KO.

Do it yourself (interesting)

Try to find the Pfam annotations of the genes are annotated with the most common KO. Check the Pfam DB and KEGG DB for what these annotations mean. Do they correspond to a similar function?

End of today’s lesson. Next time, we will look at quantifying the functional annotations.