.. _runBWA:
=================================================
Session 3: Run BWA, inspect and filter the output
=================================================
In this part of the course, we will not use the `IMP3 `_ pipeline. You have done some mapping already in the :ref:`second session `.
Today, we will look at some of the commands IMP3 uses on their own.
In this session, you can use the environments IMP3 provides on crunchomics to preprocess data with BWA-MEM.
.. admonition:: External users
| If you've got a working install of BWA and samtools, you can use those (We're using samtools version 1.9).
.. warning::
| If you haven't set up your conda enviroment yet, do the set up as described in the :ref:`first session ` now.
Data
----
For this session, there are a few small files prepared on crunchomics:
.. code-block:: console
cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS
ls
.. code-block:: console
less COSMIC_vag.test.r1.fq
In addition, you can use reference files from the IMP3 databases or the test set:
.. code-block:: console
ls /zfs/omics/projects/metatools/DB/filtering/
ls /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES
Running BWA-MEM indexing
------------------------
Normally, you first need to build an index of you references. You can do this like so:
.. code-block:: console
cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES/mg.assembly.merged.fa .
conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/76e02d85877600c41ac84cb7bc27a189
bwa index mg.assembly.merged.fa
.. admonition:: RESOURCES!
| Don't do this on the frontend, if you are using a real genome (it's okay for the test data).
.. warning::
| You need to copy the reference file to your personal space or scratch, because you can't write into the examples folder on metatools.
You could also use the indices in the example data folder.
Running mapping with BWA-MEM
----------------------------
In the next step, you run the actual mapping.
.. code-block:: console
bwa mem -v 1 -t 1 mg.assembly.merged.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.1.fastq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.2.fastq >> test_alignment.sam
You can find the help for bwa mem by typing ``bwa mem``.
You can run other data, for example, the ``COSMIC_vag.test.r1|2.fq`` with the ``hg38.fa``.
Take a look at the alignment file
---------------------------------
The output of the BWA-MEM step is an alignment file in .sam format. There is a long header, but after that (for example at the end of the file), you see the alignment information as discussed in the lecture.
.. code-block:: console
tail test_alignment.sam
Converting the alignment file to binary (compressed) format
-----------------------------------------------------------
Usually, we will not want to store the big .sam file on our disk. Instead, we convert it to .bam format.
.. code-block:: console
samtools view --threads 1 -bS test_alignment.sam > test_alignment.bam
You could also do this in one step with the mapping:
.. code-block:: console
bwa mem -v 1 -t 1 /zfs/omics/projects/metatools/DB/filtering/hg38.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq |\
samtools view --threads 1 -bS - > COSMIC_alignment.bam
You can always look at a .bam file with ``samtools view``:
.. code-block:: console
samtools view COSMIC_alignment.bam | less
Filtering the .bam file
-----------------------
You can use ``samtools view`` also to filter the reads. For example, you can use ``-f 4`` to find only reads which don't map your reference.
.. code-block:: console
samtools view -f 4 COSMIC_alignment.bam | less
Likewise, ``-F 4`` only gives you reads which map. But be careful, there's also a second read.
Consult `explain flags `_ to find other flags you may want to set.
`This page `_ contains the manual of samtools.
More complex filtering
----------------------
In IMP3, we use this command to remove reads mapping to a host genome:
.. code-block:: console
samtools merge --threads 1 -u - \
<(samtools view --threads 1 -u -f 4 -F 264 COSMIC_alignment.bam) \
<(samtools view --threads 1 -u -f 8 -F 260 COSMIC_alignment.bam) \
<(samtools view --threads 1 -u -f 12 -F 256 COSMIC_alignment.bam) | \
samtools view --threads 1 -bF 0x800 - | \
samtools sort --threads 1 -m 3G -n - | \
bamToFastq -i stdin -fq COSMIC_filtered.r1.fq -fq2 COSMIC_filtered.r2.fq
conda deactivate
Can you figure out what happens here?