BWA-MEM

BWA-MEM (Burrows-Wheeler Aligner - Maximal Exact Match) is a fast and accurate alignment algorithm for mapping sequencing reads (70 bp to 1 Mbp) to a reference genome. It is the recommended algorithm from the BWA suite for most applications, particularly for Illumina reads ≥70 bp.

http://bio-bwa.sourceforge.net/
https://github.com/lh3/bwa

Key Features: * Fast alignment using BWT (Burrows-Wheeler Transform) indexing * Handles reads from 70 bp to several Mbp (long reads, PacBio, Nanopore) * Supports split alignments (chimeric reads, structural variants) * Efficiently handles sequencing errors and polymorphisms * Compatible with paired-end and single-end data * Generates SAM/BAM output with alignment quality scores

Typical Workflow:

Step 1: Index the reference genome (one-time setup):

bwa index -p ref_index reference.fasta

This creates index files (.amb, .ann, .bwt, .pac, .sa) that enable fast searching. The index only needs to be built once per reference genome.

Step 2: Align paired-end reads:

bwa mem \
    -t 8 \                          # use 8 threads
    -M \                            # mark shorter split hits as secondary (for Picard compatibility)
    -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \  # read group info
    ref_index \                     # reference index prefix
    sample_R1.fastq.gz \            # forward reads
    sample_R2.fastq.gz \            # reverse reads
| samtools view -bS - \             # convert SAM to BAM
| samtools sort -@ 4 -o sample.sorted.bam -  # sort by coordinate

Step 3: Index the BAM file:

samtools index sample.sorted.bam

Why use BWA-MEM over BWA-ALN? BWA-MEM is faster and more accurate than the older BWA-ALN algorithm, especially for reads ≥70 bp. It uses a different seeding strategy based on maximal exact matches (MEMs) that allows it to handle longer reads and tolerate more sequencing errors. BWA-MEM also natively supports split alignments (chimeric reads), making it suitable for detecting structural variants and mapping RNA-seq reads that span exon junctions (though dedicated spliced aligners like STAR are preferred for RNA-seq).

Read Group (@RG) Tags: The -R parameter adds read group information to the BAM file, which is essential for downstream analysis with tools like GATK. The read group tags include: - ID: Unique identifier for the read group (often flowcell.lane) - SM: Sample name (biological sample identifier) - PL: Platform (e.g., ILLUMINA, PACBIO) - LB: Library identifier (useful when multiple libraries from the same sample) This metadata enables multi-sample variant calling and helps track data provenance.

Alignment Quality and MAPQ Scores: BWA-MEM assigns a mapping quality (MAPQ) score to each alignment, indicating the probability that the alignment is incorrect. MAPQ = 60 means the alignment has a 1/1,000,000 chance of being wrong (P = 10^(-60/10)). A MAPQ ≥ 30 is generally considered high-quality. Reads with multiple equally good alignments receive MAPQ = 0, indicating ambiguous mapping.

Animation

These animations were created with Manim Community. Source scripts are in tools/animations/.

Conceptual Overview
Step-by-Step Algorithm

A visual walkthrough of how BWA-MEM aligns a read to a reference:

A read is shown with three Maximal Exact Matches (MEMs) highlighted on the reference
MEMs are chained into a colinear alignment path
Gaps between MEMs are filled using Smith-Waterman local extension
The final alignment is emitted as a SAM record with CIGAR string

Coming soon: Upload the rendered video to YouTube and replace this placeholder with .

To render locally:

cd tools/animations
manim -pqh --media_dir ~/Desktop/manim_animations bwamem_conceptual.py BwamemConceptual

A deeper dive into the Burrows-Wheeler Transform and FM-index:

BWT construction — step-by-step rotation and sorting of “BANANA$” to build the BWT string
FM-index backward search — querying “ANA” right-to-left through the BWT to find matching rows
Suffix Array lookup — converting FM-index row ranges to genome positions
MEM extension — growing a seed match left and right until a mismatch is hit

Coming soon: Upload the rendered video to YouTube and replace this placeholder with .

To render locally:

cd tools/animations
manim -pqh --media_dir ~/Desktop/manim_animations bwamem_stepbystep.py BwamemStepByStep

--- title: "BWA-MEM" --- BWA-MEM (Burrows-Wheeler Aligner - Maximal Exact Match) is a fast and accurate alignment algorithm for mapping sequencing reads (70 bp to 1 Mbp) to a reference genome. It is the recommended algorithm from the BWA suite for most applications, particularly for Illumina reads ≥70 bp. - http://bio-bwa.sourceforge.net/ - https://github.com/lh3/bwa **Key Features:** * Fast alignment using BWT (Burrows-Wheeler Transform) indexing * Handles reads from 70 bp to several Mbp (long reads, PacBio, Nanopore) * Supports split alignments (chimeric reads, structural variants) * Efficiently handles sequencing errors and polymorphisms * Compatible with paired-end and single-end data * Generates SAM/BAM output with alignment quality scores **Typical Workflow:** **Step 1: Index the reference genome (one-time setup):** ```bash bwa index -p ref_index reference.fasta ``` This creates index files (.amb, .ann, .bwt, .pac, .sa) that enable fast searching. The index only needs to be built once per reference genome. **Step 2: Align paired-end reads:** ```bash bwa mem \ -t 8 \ # use 8 threads -M \ # mark shorter split hits as secondary (for Picard compatibility) -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \ # read group info ref_index \ # reference index prefix sample_R1.fastq.gz \ # forward reads sample_R2.fastq.gz \ # reverse reads | samtools view -bS - \ # convert SAM to BAM | samtools sort -@ 4 -o sample.sorted.bam - # sort by coordinate ``` **Step 3: Index the BAM file:** ```bash samtools index sample.sorted.bam ``` **Why use BWA-MEM over BWA-ALN?** BWA-MEM is faster and more accurate than the older BWA-ALN algorithm, especially for reads ≥70 bp. It uses a different seeding strategy based on maximal exact matches (MEMs) that allows it to handle longer reads and tolerate more sequencing errors. BWA-MEM also natively supports split alignments (chimeric reads), making it suitable for detecting structural variants and mapping RNA-seq reads that span exon junctions (though dedicated spliced aligners like STAR are preferred for RNA-seq). **Read Group (@RG) Tags:** The `-R` parameter adds read group information to the BAM file, which is essential for downstream analysis with tools like GATK. The read group tags include: - `ID`: Unique identifier for the read group (often flowcell.lane) - `SM`: Sample name (biological sample identifier) - `PL`: Platform (e.g., ILLUMINA, PACBIO) - `LB`: Library identifier (useful when multiple libraries from the same sample) This metadata enables multi-sample variant calling and helps track data provenance. **Alignment Quality and MAPQ Scores:** BWA-MEM assigns a mapping quality (MAPQ) score to each alignment, indicating the probability that the alignment is incorrect. MAPQ = 60 means the alignment has a 1/1,000,000 chance of being wrong (P = 10^(-60/10)). A MAPQ ≥ 30 is generally considered high-quality. Reads with multiple equally good alignments receive MAPQ = 0, indicating ambiguous mapping. ## Animation These animations were created with [Manim Community](https://www.manim.community/). Source scripts are in [`tools/animations/`](https://github.com/grgrzhong/quarto/tree/main/tools/animations). ::: {.panel-tabset} ## Conceptual Overview A visual walkthrough of how BWA-MEM aligns a read to a reference: 1. A read is shown with three Maximal Exact Matches (MEMs) highlighted on the reference 2. MEMs are chained into a colinear alignment path 3. Gaps between MEMs are filled using Smith-Waterman local extension 4. The final alignment is emitted as a SAM record with CIGAR string > **Coming soon:** Upload the rendered video to YouTube and replace this placeholder with `{{< video https://www.youtube.com/embed/YOUR_VIDEO_ID >}}`. To render locally: ```bash cd tools/animations manim -pqh --media_dir ~/Desktop/manim_animations bwamem_conceptual.py BwamemConceptual ``` ## Step-by-Step Algorithm A deeper dive into the Burrows-Wheeler Transform and FM-index: 1. **BWT construction** — step-by-step rotation and sorting of "BANANA$" to build the BWT string 2. **FM-index backward search** — querying "ANA" right-to-left through the BWT to find matching rows 3. **Suffix Array lookup** — converting FM-index row ranges to genome positions 4. **MEM extension** — growing a seed match left and right until a mismatch is hit > **Coming soon:** Upload the rendered video to YouTube and replace this placeholder with `{{< video https://www.youtube.com/embed/YOUR_VIDEO_ID >}}`. To render locally: ```bash cd tools/animations manim -pqh --media_dir ~/Desktop/manim_animations bwamem_stepbystep.py BwamemStepByStep ``` :::