Session 2: Run preprocessing on toy data
Like in most of the course, we will use the IMP3 pipeline to practice working with metagenomics data. IMP3 is implemented in Snakemake. We will also look at some commands on their own.
Remember, IMP3 has two inputs:
data
a configuration file
In this session, you can run IMP3 on crunchomics to preprocess some data.
External users
Warning
Data
For this session, there are a few different (small) files prepared on crunchomics:
cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS
Take a look at these files. Based on the lecture, can you spot the Novaseq file?
less COSMIC_vag.test.r1.fq
Configuration of the preprocessing runs
There is a config file for the preprocessing which you can copy to your ~/personal
folder.
cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/02_preprocessing/configs/config.preprocessing.yaml prep.1.config.yaml
This config will work as is on the normal IMP3 test data set. Change the settings in the config file to use the other data sets, e.g.:
raws:
Metagenomics: "/zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq"
See here for more information.
Dry-run
First, you should always perform a dry run. This will detect problems in your configuration file. Check the first session :ref:` first session <_testrun>` if you don’t remember how to do this.
Submit the preprocessing runs
If the dry-run was successful, you’re set to submit the test run to the compute nodes. First, check which compute nodes have free slots, as in the first session.
You can submit multiple IMP3 runs at the same time, but you need to have different config file for each. And you need to set a different output directory in each.
outputdir: "IMP3_test_preprocessing"
For the data set with the novaseq data, you should also set
nextseq: true
The dataset COSMIC_vag.test.r1|2.fq comes from a sample that contains still host genomics DNA. Compare the difference between the output with the default phiX filter and also filtering the human genome:
filtering:
filter: "phiX174 hg38"
We will look at last week’s output and these outputs during the session.
Summarizing fastqc results
If you have results from multiple runs and would like to summarize the fastQC results, you can use MultiQC.
cd ~/personal
conda activate /zfs/omics/projects/metatools/TOOLS/dadasnake/conda/d2cefee4ae1cdf2445ac1cd01cb982dc
export LC_ALL=en_GB.utf8
export LANG=en_GB.utf8
multiqc -n multiqc_report.html /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/FASTQC/sample*/*fastq*
conda deactivate
The report in multiqc_report.html
summarizes all fastqc results found with the wildcards in the positional argument.