Session 2: Run preprocessing on toy data

Like in most of the course, we will use the IMP3 pipeline to practice working with metagenomics data. IMP3 is implemented in Snakemake. We will also look at some commands on their own.

Remember, IMP3 has two inputs:

data
a configuration file

In this session, you can run IMP3 on crunchomics to preprocess some data.

External users

If you’re not on crunchomics when following this documentation, have a look at the IMP3 documentation to learn how to install IMP3.

Warning

If you haven’t set up your enviroment yet, do the set up as described in the first session now.

Data

For this session, there are a few different (small) files prepared on crunchomics:

cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS

Take a look at these files. Based on the lecture, can you spot the Novaseq file?

less COSMIC_vag.test.r1.fq

Configuration of the preprocessing runs

There is a config file for the preprocessing which you can copy to your ~/personal folder.

cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/02_preprocessing/configs/config.preprocessing.yaml prep.1.config.yaml

This config will work as is on the normal IMP3 test data set. Change the settings in the config file to use the other data sets, e.g.:

raws:
  Metagenomics: "/zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq"

See here for more information.

Dry-run

First, you should always perform a dry run. This will detect problems in your configuration file. Check the first session :ref:` first session <_testrun>` if you don’t remember how to do this.

Submit the preprocessing runs

If the dry-run was successful, you’re set to submit the test run to the compute nodes. First, check which compute nodes have free slots, as in the first session.

You can submit multiple IMP3 runs at the same time, but you need to have different config file for each. And you need to set a different output directory in each.

outputdir: "IMP3_test_preprocessing"

For the data set with the novaseq data, you should also set

nextseq: true

The dataset COSMIC_vag.test.r1|2.fq comes from a sample that contains still host genomics DNA. Compare the difference between the output with the default phiX filter and also filtering the human genome:

filtering:
  filter: "phiX174 hg38"

We will look at last week’s output and these outputs during the session.

Summarizing fastqc results

If you have results from multiple runs and would like to summarize the fastQC results, you can use MultiQC.

cd ~/personal
conda activate /zfs/omics/projects/metatools/TOOLS/dadasnake/conda/d2cefee4ae1cdf2445ac1cd01cb982dc
export LC_ALL=en_GB.utf8
export LANG=en_GB.utf8
multiqc -n multiqc_report.html /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/FASTQC/sample*/*fastq*
conda deactivate

The report in multiqc_report.html summarizes all fastqc results found with the wildcards in the positional argument.