STAR

STAR (Spliced Transcripts Alignment to a Reference) is an ultrafast RNA-seq aligner designed specifically for mapping spliced reads across splice junctions. It is the gold standard for aligning RNA-seq data and is widely used for gene expression quantification, transcript discovery, and fusion gene detection.

https://github.com/alexdobin/STAR
STAR manual: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Key Features: * Extremely fast: can align millions of reads per minute * Accurately maps reads spanning splice junctions (exon-exon boundaries) * Handles reads from 50 bp to several hundred bp * Detects novel splice junctions (unannotated transcripts) * Identifies gene fusions and chimeric transcripts * Generates read counts per gene for differential expression analysis * Compatible with downstream tools (Cufflinks, StringTie, DESeq2) * Outputs in SAM/BAM format with junction information

Index Building Considerations: STAR requires significant RAM for indexing and alignment (~30 GB for human genome). The index size depends on genome size and read length. For optimal performance, the index should match the read length used in sequencing.

Typical Workflow:

Step 1: Generate genome index (one-time setup):

STAR \
    --runMode genomeGenerate \
    --genomeDir /path/to/genome_index \       # output directory for index
    --genomeFastaFiles reference.fasta \      # reference genome FASTA
    --sjdbGTFfile annotation.gtf \            # gene annotation (GTF/GFF)
    --sjdbOverhang 99 \                       # read length - 1 (for 100 bp reads)
    --runThreadN 8                            # number of threads

Why sjdbOverhang = read length - 1? The --sjdbOverhang parameter tells STAR the maximum possible overhang for splice junctions. For reads of length L, the maximum overhang is L-1 (when only 1 bp aligns on one side of the junction). Setting this correctly allows STAR to accurately detect junctions near read ends. For variable read lengths, use a representative value (e.g., 100 for most Illumina data) or rebuild the index for different read lengths.

Step 2: Align RNA-seq reads:

STAR \
    --runMode alignReads \
    --genomeDir /path/to/genome_index \       # genome index directory
    --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \  # input FASTQ files
    --readFilesCommand zcat \                 # command to uncompress files (for .gz)
    --outFileNamePrefix sample_ \             # output file prefix
    --outSAMtype BAM SortedByCoordinate \     # output sorted BAM directly
    --quantMode GeneCounts \                  # count reads per gene
    --outSAMattributes NH HI AS nM NM \       # include useful SAM tags
    --runThreadN 8 \                          # number of threads
    --limitBAMsortRAM 20000000000             # RAM for BAM sorting (20 GB)

Key Output Files: - sample_Aligned.sortedByCoord.out.bam: Aligned reads sorted by coordinate (ready for downstream analysis) - sample_Log.final.out: Summary statistics (mapping rate, junction counts) - sample_SJ.out.tab: Detected splice junctions with read support - sample_ReadsPerGene.out.tab: Gene-level read counts (for DESeq2, edgeR)

Understanding STAR Alignment Modes: STAR has different alignment strategies controlled by --alignIntronMin and --alignIntronMax: - Default (spliced alignment): Allows junctions from 21 bp to 0 bp (no max) - For close organisms with different intron sizes, adjust these parameters - --alignIntronMin 20 --alignIntronMax 1000000 is typical for mammalian genomes

Multi-mapping reads: STAR can report multiple alignments for reads that map to multiple locations (e.g., gene families, repetitive regions): - --outFilterMultimapNmax 1: Report only uniquely mapped reads (stringent) - --outFilterMultimapNmax 20: Report up to 20 alignments per read (permissive) - Multi-mappers are useful for detecting gene family expression but may complicate quantification

Gene Counting with –quantMode GeneCounts: STAR can directly count reads per gene during alignment, saving time compared to running a separate counting tool (e.g., featureCounts). The output file ReadsPerGene.out.tab contains four columns: 1. Gene ID 2. Counts (unstranded) 3. Counts (1st read strand aligned with RNA, strand-specific protocol) 4. Counts (2nd read strand aligned with RNA, strand-specific protocol)

Choose the appropriate column based on your library preparation protocol (most RNA-seq is now strand-specific).

Two-pass mapping for novel junction detection: For samples with many novel splice junctions (e.g., de novo transcriptome, non-model organisms), use two-pass mode:

STAR \
    --runMode alignReads \
    --genomeDir /path/to/genome_index \
    --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --twopassMode Basic \                     # enables two-pass mode
    --runThreadN 8

In two-pass mode, STAR first maps reads to detect novel junctions, then re-builds the index including these junctions for a second mapping pass. This improves alignment accuracy for unannotated transcripts.

Chimeric/Fusion Detection: STAR can detect gene fusions and chimeric transcripts (common in cancer):

STAR \
    --runMode alignReads \
    --genomeDir /path/to/genome_index \
    --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --chimSegmentMin 20 \                     # minimum chimeric segment length
    --chimOutType Junctions \                 # output chimeric junctions
    --runThreadN 8

The sample_Chimeric.out.junction file contains candidate fusion events that can be further validated with tools like STAR-Fusion.

Animation

These animations were created with Manim Community. Source scripts are in tools/animations/.

Conceptual Overview
Step-by-Step Algorithm

A visual walkthrough of how STAR handles spliced reads:

Pre-mRNA is shown with exons and introns
Splicing produces mature mRNA
An Illumina read is drawn spanning the Exon 1–Exon 2 junction
STAR splits the read into two segments and maps each to the correct exon
The detected splice junction is recorded in SJ.out.tab

Coming soon: Upload the rendered video to YouTube and replace this placeholder with .

To render locally:

cd tools/animations
manim -pqh --media_dir ~/Desktop/manim_animations star_conceptual.py StarConceptual

A deeper dive into STAR’s internal algorithm across four steps:

Suffix Array index — how STAR builds its genome index from all rotations of the reference sequence
MMP Seeding — Maximal Mappable Prefix seeds are found for each read segment
Seed clustering — colinear seeds are grouped into candidate alignment windows
Two-pass strategy — Pass 1 discovers novel splice junctions; Pass 2 uses them for improved alignment

Coming soon: Upload the rendered video to YouTube and replace this placeholder with .

To render locally:

cd tools/animations
manim -pqh --media_dir ~/Desktop/manim_animations star_stepbystep.py StarStepByStep

--- title: "STAR" --- STAR (Spliced Transcripts Alignment to a Reference) is an ultrafast RNA-seq aligner designed specifically for mapping spliced reads across splice junctions. It is the gold standard for aligning RNA-seq data and is widely used for gene expression quantification, transcript discovery, and fusion gene detection. - https://github.com/alexdobin/STAR - STAR manual: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf **Key Features:** * Extremely fast: can align millions of reads per minute * Accurately maps reads spanning splice junctions (exon-exon boundaries) * Handles reads from 50 bp to several hundred bp * Detects novel splice junctions (unannotated transcripts) * Identifies gene fusions and chimeric transcripts * Generates read counts per gene for differential expression analysis * Compatible with downstream tools (Cufflinks, StringTie, DESeq2) * Outputs in SAM/BAM format with junction information **Index Building Considerations:** STAR requires significant RAM for indexing and alignment (~30 GB for human genome). The index size depends on genome size and read length. For optimal performance, the index should match the read length used in sequencing. **Typical Workflow:** **Step 1: Generate genome index (one-time setup):** ```bash STAR \ --runMode genomeGenerate \ --genomeDir /path/to/genome_index \ # output directory for index --genomeFastaFiles reference.fasta \ # reference genome FASTA --sjdbGTFfile annotation.gtf \ # gene annotation (GTF/GFF) --sjdbOverhang 99 \ # read length - 1 (for 100 bp reads) --runThreadN 8 # number of threads ``` **Why sjdbOverhang = read length - 1?** The `--sjdbOverhang` parameter tells STAR the maximum possible overhang for splice junctions. For reads of length L, the maximum overhang is L-1 (when only 1 bp aligns on one side of the junction). Setting this correctly allows STAR to accurately detect junctions near read ends. For variable read lengths, use a representative value (e.g., 100 for most Illumina data) or rebuild the index for different read lengths. **Step 2: Align RNA-seq reads:** ```bash STAR \ --runMode alignReads \ --genomeDir /path/to/genome_index \ # genome index directory --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ # input FASTQ files --readFilesCommand zcat \ # command to uncompress files (for .gz) --outFileNamePrefix sample_ \ # output file prefix --outSAMtype BAM SortedByCoordinate \ # output sorted BAM directly --quantMode GeneCounts \ # count reads per gene --outSAMattributes NH HI AS nM NM \ # include useful SAM tags --runThreadN 8 \ # number of threads --limitBAMsortRAM 20000000000 # RAM for BAM sorting (20 GB) ``` **Key Output Files:** - `sample_Aligned.sortedByCoord.out.bam`: Aligned reads sorted by coordinate (ready for downstream analysis) - `sample_Log.final.out`: Summary statistics (mapping rate, junction counts) - `sample_SJ.out.tab`: Detected splice junctions with read support - `sample_ReadsPerGene.out.tab`: Gene-level read counts (for DESeq2, edgeR) **Understanding STAR Alignment Modes:** STAR has different alignment strategies controlled by `--alignIntronMin` and `--alignIntronMax`: - Default (spliced alignment): Allows junctions from 21 bp to 0 bp (no max) - For close organisms with different intron sizes, adjust these parameters - `--alignIntronMin 20 --alignIntronMax 1000000` is typical for mammalian genomes **Multi-mapping reads:** STAR can report multiple alignments for reads that map to multiple locations (e.g., gene families, repetitive regions): - `--outFilterMultimapNmax 1`: Report only uniquely mapped reads (stringent) - `--outFilterMultimapNmax 20`: Report up to 20 alignments per read (permissive) - Multi-mappers are useful for detecting gene family expression but may complicate quantification **Gene Counting with --quantMode GeneCounts:** STAR can directly count reads per gene during alignment, saving time compared to running a separate counting tool (e.g., featureCounts). The output file `ReadsPerGene.out.tab` contains four columns: 1. Gene ID 2. Counts (unstranded) 3. Counts (1st read strand aligned with RNA, strand-specific protocol) 4. Counts (2nd read strand aligned with RNA, strand-specific protocol) Choose the appropriate column based on your library preparation protocol (most RNA-seq is now strand-specific). **Two-pass mapping for novel junction detection:** For samples with many novel splice junctions (e.g., de novo transcriptome, non-model organisms), use two-pass mode: ```bash STAR \ --runMode alignReads \ --genomeDir /path/to/genome_index \ --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix sample_ \ --outSAMtype BAM SortedByCoordinate \ --twopassMode Basic \ # enables two-pass mode --runThreadN 8 ``` In two-pass mode, STAR first maps reads to detect novel junctions, then re-builds the index including these junctions for a second mapping pass. This improves alignment accuracy for unannotated transcripts. **Chimeric/Fusion Detection:** STAR can detect gene fusions and chimeric transcripts (common in cancer): ```bash STAR \ --runMode alignReads \ --genomeDir /path/to/genome_index \ --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix sample_ \ --outSAMtype BAM SortedByCoordinate \ --chimSegmentMin 20 \ # minimum chimeric segment length --chimOutType Junctions \ # output chimeric junctions --runThreadN 8 ``` The `sample_Chimeric.out.junction` file contains candidate fusion events that can be further validated with tools like STAR-Fusion. ## Animation These animations were created with [Manim Community](https://www.manim.community/). Source scripts are in [`tools/animations/`](https://github.com/grgrzhong/quarto/tree/main/tools/animations). ::: {.panel-tabset} ## Conceptual Overview A visual walkthrough of how STAR handles spliced reads: 1. Pre-mRNA is shown with exons and introns 2. Splicing produces mature mRNA 3. An Illumina read is drawn spanning the Exon 1–Exon 2 junction 4. STAR splits the read into two segments and maps each to the correct exon 5. The detected splice junction is recorded in `SJ.out.tab` > **Coming soon:** Upload the rendered video to YouTube and replace this placeholder with `{{< video https://www.youtube.com/embed/YOUR_VIDEO_ID >}}`. To render locally: ```bash cd tools/animations manim -pqh --media_dir ~/Desktop/manim_animations star_conceptual.py StarConceptual ``` ## Step-by-Step Algorithm A deeper dive into STAR's internal algorithm across four steps: 1. **Suffix Array index** — how STAR builds its genome index from all rotations of the reference sequence 2. **MMP Seeding** — Maximal Mappable Prefix seeds are found for each read segment 3. **Seed clustering** — colinear seeds are grouped into candidate alignment windows 4. **Two-pass strategy** — Pass 1 discovers novel splice junctions; Pass 2 uses them for improved alignment > **Coming soon:** Upload the rendered video to YouTube and replace this placeholder with `{{< video https://www.youtube.com/embed/YOUR_VIDEO_ID >}}`. To render locally: ```bash cd tools/animations manim -pqh --media_dir ~/Desktop/manim_animations star_stepbystep.py StarStepByStep ``` :::