Session 3: Run BWA, inspect and filter the output

In this part of the course, we will not use the IMP3 pipeline. You have done some mapping already in the second session. Today, we will look at some of the commands IMP3 uses on their own.

In this session, you can use the environments IMP3 provides on crunchomics to preprocess data with BWA-MEM.

External users

If you’ve got a working install of BWA and samtools, you can use those (We’re using samtools version 1.9).

Warning

If you haven’t set up your conda enviroment yet, do the set up as described in the first session now.

Data

For this session, there are a few small files prepared on crunchomics:

cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS
ls

less COSMIC_vag.test.r1.fq

In addition, you can use reference files from the IMP3 databases or the test set:

ls /zfs/omics/projects/metatools/DB/filtering/
ls /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES

Running BWA-MEM indexing

Normally, you first need to build an index of you references. You can do this like so:

cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES/mg.assembly.merged.fa .
conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/76e02d85877600c41ac84cb7bc27a189
bwa index mg.assembly.merged.fa

RESOURCES!

Don’t do this on the frontend, if you are using a real genome (it’s okay for the test data).

Warning

You need to copy the reference file to your personal space or scratch, because you can’t write into the examples folder on metatools.

You could also use the indices in the example data folder.

Running mapping with BWA-MEM

In the next step, you run the actual mapping.

bwa mem -v 1 -t 1  mg.assembly.merged.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.1.fastq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.2.fastq >> test_alignment.sam

You can find the help for bwa mem by typing bwa mem.

You can run other data, for example, the COSMIC_vag.test.r1|2.fq with the hg38.fa.

Take a look at the alignment file

The output of the BWA-MEM step is an alignment file in .sam format. There is a long header, but after that (for example at the end of the file), you see the alignment information as discussed in the lecture.

tail test_alignment.sam

Converting the alignment file to binary (compressed) format

Usually, we will not want to store the big .sam file on our disk. Instead, we convert it to .bam format.

samtools view --threads 1 -bS test_alignment.sam > test_alignment.bam

You could also do this in one step with the mapping:

bwa mem -v 1 -t 1  /zfs/omics/projects/metatools/DB/filtering/hg38.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq |\
  samtools view --threads 1 -bS - > COSMIC_alignment.bam

You can always look at a .bam file with samtools view:

samtools view COSMIC_alignment.bam | less

Filtering the .bam file

You can use samtools view also to filter the reads. For example, you can use -f 4 to find only reads which don’t map your reference.

samtools view -f 4 COSMIC_alignment.bam | less

Likewise, -F 4 only gives you reads which map. But be careful, there’s also a second read.

Consult explain flags to find other flags you may want to set. This page contains the manual of samtools.

More complex filtering

In IMP3, we use this command to remove reads mapping to a host genome:

samtools merge --threads 1 -u - \
  <(samtools view --threads 1 -u  -f 4 -F 264 COSMIC_alignment.bam) \
  <(samtools view --threads 1 -u -f 8 -F 260 COSMIC_alignment.bam) \
  <(samtools view --threads 1 -u -f 12 -F 256 COSMIC_alignment.bam)  | \
 samtools view --threads 1 -bF 0x800 - | \
 samtools sort --threads 1 -m 3G -n - | \
 bamToFastq -i stdin -fq COSMIC_filtered.r1.fq -fq2 COSMIC_filtered.r2.fq

 conda deactivate

Can you figure out what happens here?