.. _preprun:

========================================
Session 2: Run preprocessing on toy data
========================================

Like in most of the course, we will use the `IMP3 <https://imp3.readthedocs.io/en/latest/>`_ pipeline to practice working with metagenomics data. IMP3 is implemented in `Snakemake <https://snakemake.readthedocs.io/en/stable/index.html>`_. 
We will also look at some commands on their own.

Remember, IMP3 has two inputs:

* data
* a configuration file

In this session, you can run IMP3 on crunchomics to preprocess some data. 


.. admonition:: External users

   | If you're not on crunchomics when following this documentation, have a look at `the IMP3 documentation <https://imp3.readthedocs.io/en/latest/>`_ to learn how to install IMP3. 


.. warning:: 

   | If you haven't set up your enviroment yet, do the set up as described in the :ref:`first session <testrun>` now.


Data
----

For this session, there are a few different (small) files prepared on crunchomics:

.. code-block:: console

   cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS


Take a look at these files. Based on the lecture, can you spot the Novaseq file?


.. code-block:: console

   less COSMIC_vag.test.r1.fq


Configuration of the preprocessing runs
---------------------------------------

There is a config file for the preprocessing which you can copy to your ``~/personal`` folder.

.. code-block:: console

   cd ~/personal
   cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/02_preprocessing/configs/config.preprocessing.yaml prep.1.config.yaml


This config will work as is on the normal IMP3 test data set.
Change the settings in the config file to use the other data sets, e.g.: 

.. code-block:: yaml

   raws:
     Metagenomics: "/zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq"
   

See `here <https://imp3.readthedocs.io/en/latest/running/configuring.html>`_ for more information.


Dry-run
-------

First, you should always perform a dry run. This will detect problems in your configuration file. Check the first session :ref:` first session <_testrun>` if you don't remember how to do this.


Submit the preprocessing runs
-----------------------------

If the dry-run was successful, you're set to submit the test run to the compute nodes. First, check which compute nodes have free slots, as in the first session.

You can submit multiple IMP3 runs at the same time, but you need to have different config file for each. And you need to set a different output directory in each.

.. code-block:: yaml

   outputdir: "IMP3_test_preprocessing"


For the data set with the novaseq data, you should also set

.. code-block:: yaml

   nextseq: true


The dataset `COSMIC_vag.test.r1|2.fq` comes from a sample that contains still host genomics DNA. Compare the difference between the output with the default phiX filter and also filtering the human genome:

.. code-block:: yaml

   filtering:
     filter: "phiX174 hg38"


We will look at last week's output and these outputs during the session.


Summarizing fastqc results
--------------------------

If you have results from multiple runs and would like to summarize the fastQC results, you can use MultiQC. 

.. code-block:: console

   cd ~/personal
   conda activate /zfs/omics/projects/metatools/TOOLS/dadasnake/conda/d2cefee4ae1cdf2445ac1cd01cb982dc
   export LC_ALL=en_GB.utf8
   export LANG=en_GB.utf8
   multiqc -n multiqc_report.html /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/FASTQC/sample*/*fastq*
   conda deactivate

The report in ``multiqc_report.html`` summarizes all fastqc results found with the wildcards in the positional argument.