Session 7: Take a look at annotations
In the last part of the course, you have used the analysis module of the IMP3 pipeline. In this session, you can explore the annotations a bit further.
Data
In this session, you can use a .gff file prepared on crunchomics:
cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/ANNOTATION
ls
Quick look at the annotations
Here are a couple of bash tricks you can use to get a quick overview of the annotations.
Take a look at the file:
head annotation_CDS_RNA_hmms.gff
tail -n 5 annotation_CDS_RNA_hmms.gff
less annotation_CDS_RNA_hmms.gff
(You exit less with q)
How many annotated sequences?
wc -l annotation_CDS_RNA_hmms.gff
How many are CDS?
grep " CDS " -c annotation_CDS_RNA_hmms.gff
What fields are the in column 3?
cut -f 3 annotation_CDS_RNA_hmms.gff | sort | uniq -c
You could have also looked which other fields there are in column 3:
grep " CDS " -v annotation_CDS_RNA_hmms.gff | cut -f 3 | sort | uniq -c
How many sequences are annotated with a KEGG KO?
grep "kegg_ko" -c annotation_CDS_RNA_hmms.gff
Which KO is most common?
grep "kegg_ko" annotation_CDS_RNA_hmms.gff | sed 's#.*kegg_ko=\([^;]*\).*#\1#' | tr ',' '\n' | sort | uniq -c | sort -n
Do it yourself (easy)
Try to adapt the code above to the Pfam domains.
Do it yourself (not so easy)
Try to find the cog annotations of the genes are annotated with the most common KO.
Do it yourself (interesting)
Try to find the Pfam annotations of the genes are annotated with the most common KO. Check the Pfam DB and KEGG DB for what these annotations mean. Do they correspond to a similar function?
End of today’s lesson. Next time, we will look at quantifying the functional annotations.