.. _preprun: ======================================== Session 2: Run preprocessing on toy data ======================================== Like in most of the course, we will use the `IMP3 `_ pipeline to practice working with metagenomics data. IMP3 is implemented in `Snakemake `_. We will also look at some commands on their own. Remember, IMP3 has two inputs: * data * a configuration file In this session, you can run IMP3 on crunchomics to preprocess some data. .. admonition:: External users | If you're not on crunchomics when following this documentation, have a look at `the IMP3 documentation `_ to learn how to install IMP3. .. warning:: | If you haven't set up your enviroment yet, do the set up as described in the :ref:`first session ` now. Data ---- For this session, there are a few different (small) files prepared on crunchomics: .. code-block:: console cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS Take a look at these files. Based on the lecture, can you spot the Novaseq file? .. code-block:: console less COSMIC_vag.test.r1.fq Configuration of the preprocessing runs --------------------------------------- There is a config file for the preprocessing which you can copy to your ``~/personal`` folder. .. code-block:: console cd ~/personal cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/02_preprocessing/configs/config.preprocessing.yaml prep.1.config.yaml This config will work as is on the normal IMP3 test data set. Change the settings in the config file to use the other data sets, e.g.: .. code-block:: yaml raws: Metagenomics: "/zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq" See `here `_ for more information. Dry-run ------- First, you should always perform a dry run. This will detect problems in your configuration file. Check the first session :ref:` first session <_testrun>` if you don't remember how to do this. Submit the preprocessing runs ----------------------------- If the dry-run was successful, you're set to submit the test run to the compute nodes. First, check which compute nodes have free slots, as in the first session. You can submit multiple IMP3 runs at the same time, but you need to have different config file for each. And you need to set a different output directory in each. .. code-block:: yaml outputdir: "IMP3_test_preprocessing" For the data set with the novaseq data, you should also set .. code-block:: yaml nextseq: true The dataset `COSMIC_vag.test.r1|2.fq` comes from a sample that contains still host genomics DNA. Compare the difference between the output with the default phiX filter and also filtering the human genome: .. code-block:: yaml filtering: filter: "phiX174 hg38" We will look at last week's output and these outputs during the session. Summarizing fastqc results -------------------------- If you have results from multiple runs and would like to summarize the fastQC results, you can use MultiQC. .. code-block:: console cd ~/personal conda activate /zfs/omics/projects/metatools/TOOLS/dadasnake/conda/d2cefee4ae1cdf2445ac1cd01cb982dc export LC_ALL=en_GB.utf8 export LANG=en_GB.utf8 multiqc -n multiqc_report.html /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/FASTQC/sample*/*fastq* conda deactivate The report in ``multiqc_report.html`` summarizes all fastqc results found with the wildcards in the positional argument.