.. _preprun:
========================================
Session 2: Run preprocessing on toy data
========================================
Like in most of the course, we will use the `IMP3 `_ pipeline to practice working with metagenomics data. IMP3 is implemented in `Snakemake `_.
We will also look at some commands on their own.
Remember, IMP3 has two inputs:
* data
* a configuration file
In this session, you can run IMP3 on crunchomics to preprocess some data.
.. admonition:: External users
| If you're not on crunchomics when following this documentation, have a look at `the IMP3 documentation `_ to learn how to install IMP3.
.. warning::
| If you haven't set up your enviroment yet, do the set up as described in the :ref:`first session ` now.
Data
----
For this session, there are a few different (small) files prepared on crunchomics:
.. code-block:: console
cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS
Take a look at these files. Based on the lecture, can you spot the Novaseq file?
.. code-block:: console
less COSMIC_vag.test.r1.fq
Configuration of the preprocessing runs
---------------------------------------
There is a config file for the preprocessing which you can copy to your ``~/personal`` folder.
.. code-block:: console
cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/02_preprocessing/configs/config.preprocessing.yaml prep.1.config.yaml
This config will work as is on the normal IMP3 test data set.
Change the settings in the config file to use the other data sets, e.g.:
.. code-block:: yaml
raws:
Metagenomics: "/zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq"
See `here `_ for more information.
Dry-run
-------
First, you should always perform a dry run. This will detect problems in your configuration file. Check the first session :ref:` first session <_testrun>` if you don't remember how to do this.
Submit the preprocessing runs
-----------------------------
If the dry-run was successful, you're set to submit the test run to the compute nodes. First, check which compute nodes have free slots, as in the first session.
You can submit multiple IMP3 runs at the same time, but you need to have different config file for each. And you need to set a different output directory in each.
.. code-block:: yaml
outputdir: "IMP3_test_preprocessing"
For the data set with the novaseq data, you should also set
.. code-block:: yaml
nextseq: true
The dataset `COSMIC_vag.test.r1|2.fq` comes from a sample that contains still host genomics DNA. Compare the difference between the output with the default phiX filter and also filtering the human genome:
.. code-block:: yaml
filtering:
filter: "phiX174 hg38"
We will look at last week's output and these outputs during the session.
Summarizing fastqc results
--------------------------
If you have results from multiple runs and would like to summarize the fastQC results, you can use MultiQC.
.. code-block:: console
cd ~/personal
conda activate /zfs/omics/projects/metatools/TOOLS/dadasnake/conda/d2cefee4ae1cdf2445ac1cd01cb982dc
export LC_ALL=en_GB.utf8
export LANG=en_GB.utf8
multiqc -n multiqc_report.html /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/FASTQC/sample*/*fastq*
conda deactivate
The report in ``multiqc_report.html`` summarizes all fastqc results found with the wildcards in the positional argument.