Session 3: Run BWA, inspect and filter the output
In this part of the course, we will not use the IMP3 pipeline. You have done some mapping already in the second session. Today, we will look at some of the commands IMP3 uses on their own.
In this session, you can use the environments IMP3 provides on crunchomics to preprocess data with BWA-MEM.
External users
Warning
Data
For this session, there are a few small files prepared on crunchomics:
cd /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS
ls
less COSMIC_vag.test.r1.fq
In addition, you can use reference files from the IMP3 databases or the test set:
ls /zfs/omics/projects/metatools/DB/filtering/
ls /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES
Running BWA-MEM indexing
Normally, you first need to build an index of you references. You can do this like so:
cd ~/personal
cp /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/REFERENCES/mg.assembly.merged.fa .
conda activate /zfs/omics/projects/metatools/TOOLS/IMP3/conda/76e02d85877600c41ac84cb7bc27a189
bwa index mg.assembly.merged.fa
RESOURCES!
Warning
You could also use the indices in the example data folder.
Running mapping with BWA-MEM
In the next step, you run the actual mapping.
bwa mem -v 1 -t 1 mg.assembly.merged.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.1.fastq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/test.mg.2.fastq >> test_alignment.sam
You can find the help for bwa mem by typing bwa mem
.
You can run other data, for example, the COSMIC_vag.test.r1|2.fq
with the hg38.fa
.
Take a look at the alignment file
The output of the BWA-MEM step is an alignment file in .sam format. There is a long header, but after that (for example at the end of the file), you see the alignment information as discussed in the lecture.
tail test_alignment.sam
Converting the alignment file to binary (compressed) format
Usually, we will not want to store the big .sam file on our disk. Instead, we convert it to .bam format.
samtools view --threads 1 -bS test_alignment.sam > test_alignment.bam
You could also do this in one step with the mapping:
bwa mem -v 1 -t 1 /zfs/omics/projects/metatools/DB/filtering/hg38.fa /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r1.fq /zfs/omics/projects/metatools/SANDBOX/Metagenomics101/EXAMPLE_DATA/READS/COSMIC_vag.test.r2.fq |\
samtools view --threads 1 -bS - > COSMIC_alignment.bam
You can always look at a .bam file with samtools view
:
samtools view COSMIC_alignment.bam | less
Filtering the .bam file
You can use samtools view
also to filter the reads. For example, you can use -f 4
to find only reads which don’t map your reference.
samtools view -f 4 COSMIC_alignment.bam | less
Likewise, -F 4
only gives you reads which map. But be careful, there’s also a second read.
Consult explain flags to find other flags you may want to set. This page contains the manual of samtools.
More complex filtering
In IMP3, we use this command to remove reads mapping to a host genome:
samtools merge --threads 1 -u - \
<(samtools view --threads 1 -u -f 4 -F 264 COSMIC_alignment.bam) \
<(samtools view --threads 1 -u -f 8 -F 260 COSMIC_alignment.bam) \
<(samtools view --threads 1 -u -f 12 -F 256 COSMIC_alignment.bam) | \
samtools view --threads 1 -bF 0x800 - | \
samtools sort --threads 1 -m 3G -n - | \
bamToFastq -i stdin -fq COSMIC_filtered.r1.fq -fq2 COSMIC_filtered.r2.fq
conda deactivate
Can you figure out what happens here?