iVar

iVar is a computational tool designed for the analysis of viral sequencing data, particularly in the context of identifying and quantifying variants within viral populations. It is commonly used in the field of virology and genomics to process high-throughput sequencing data, allowing researchers to detect mutations, track viral evolution, and understand the dynamics of viral infections.

https://andersen-lab.github.io/ivar/html/

iVar primer trimming * it is essential to remove artefactual sequences that can bias variant calling and consensus generation. * reads that truly overlap the primer regions are trimmed, while reads that do not align to primer sites are left intact. * iVar uses the BED file to locate the primer sequences at the start and end of the aligned reads and replaces those bases with ‘S’ (soft-clip) in the BAM file. * The resulting BAM file contains reads with the primers soft-clipped, preventing the primer sequence itself from introducing false variants.

iVar Quality trimming * iVar uses a sliding window approach (Default: 4) to do quality trimming. The windows slides from the 5’ end to the 3’ end and if at any point the average base quality in the window falls below the threshold (Default: 20), the remaining read is soft clipped. * If after trimming, the length of the read is greater than the minimum length specified (Default: 30), the read is written to the new trimmed BAM file.

# Index the sorted BAM before trimming
samtools index Align/sample.sorted.bam

# Trim primers
ivar trim \
    -i Align/sample.sorted.bam \  # input sorted BAM
    -b primers.bed \              # primer coordinates BED file (generated in step 0.1)
    -e \                          # include reads that extend beyond the primer region
    -p Align/sample_trimmed       # output prefix (produces sample_trimmed.bam)

why include reads that extend beyond the primer region? Including reads that extend beyond the primer region allows iVar to trim only the portion of the read that overlaps the primer, while retaining the rest of the read sequence for downstream analysis. This is important because many reads may partially overlap the primer sites, and excluding them entirely would result in loss of valuable data. By allowing reads that extend beyond the primer region, we can maximize the amount of usable data while still effectively removing primer sequences that could bias variant calling.

iVar uses a sliding window approach (Default: 4) to do quality trimming. The windows slides from the 5’ end to the 3’ end and if at any point the average base quality in the window falls below the threshold (Default: 20), the remaining read is soft clipped. If after trimming, the length of the read is greater than the minimum length specified (Default: 30), the read is written to the new trimmed BAM file. By using a sliding window rather than a single-base check, iVar avoids being “fooled” by a single bad base in the middle of a high-quality stretch. It ensures that trimming only happens when there is a consistent trend of poor data.

Call nucleotide variants from the primer-trimmed BAM using samtools mpileup piped into iVar.

samtools mpileup \
    -aa \                            # output ALL positions, including zero-coverage sites
    -B \                             # disable BAQ (per-base alignment quality) computation
    -A \                             # include anomalous read pairs, which can be informative for variant calling in amplicon data
    -d 20000 \                       # max per-position read depth cap (20,000×), raised from default 8,000× to avoid truncating high-coverage amplicon data
    --reference KT992094.1.fasta \   # reference FASTA (required when using -aa)
    -Q 0 \                           # include bases of any quality (iVar does its own filtering), to maximize data retained for variant calling
    Align/sample.bam \
| ivar variants \
    -p iVAR_Variant/sample \         # output prefix (.tsv)
    -r KT992094.1.fasta \            # reference FASTA (for amino-acid annotation)
    -g KT992094.1.gff3               # GFF3 annotation (for codon/amino-acid annotation)

samtools mpileup
- samtools mpileup processes reads in the order they appear in your BAM file. Because mpileup requires a coordinate-sorted BAM file, it encounters reads based on their starting genomic position
- As soon as it reaches the 20,000th read for that specific position, it stops accepting any further reads for that coordinate
- Passing zero for this option sets it to the highest possible value, effectively removing the depth limit. [8000]
Why disable BAQ with -B for amplicon data? BAQ (per-base alignment quality) is designed to adjust base quality scores near indels to account for potential misalignments. However, in amplicon sequencing data, which often has very high coverage and may contain many reads that partially overlap primer sites, BAQ can be overly aggressive in down-weighting bases near indels, leading to loss of true variant signals. Disabling BAQ with -B allows iVar to apply its own variant calling filters without the influence of BAQ adjustments, which can improve sensitivity for detecting variants in amplicon data where indels may be common and coverage is high enough to support confident variant calls without BAQ’s adjustments.
Why setup the read depth cap to 20,000× with -d? Setting the read depth cap to 20,000× allows for the retention of high-coverage data that is common in amplicon sequencing, while still preventing excessive memory usage and computational time during variant calling. The default depth cap of 8,000× in samtools mpileup may truncate data in high-coverage amplicon datasets, leading to loss of variant information. By raising the cap to 20,000×, we can ensure that we capture the full depth of coverage available for variant calling, which can improve sensitivity for detecting low-frequency variants. However, setting the cap too high (e.g., 1,000,000×) could lead to excessive computational overhead without providing additional benefit for variant detection, as the accuracy of variant frequency estimation may not improve significantly beyond a certain coverage level due to diminishing returns and increased noise at very high coverage levels.
Why not use higher read depth cap (e.g., 1,000,000×) for variant calling? While a higher read depth cap (e.g., 1,000,000×) could be used to ensure that no data is truncated, it may lead to excessive memory usage and computational time during variant calling, especially for high-coverage amplicon data. Setting a cap of 20,000× strikes a balance between retaining sufficient data for accurate variant calling while avoiding the computational overhead associated with processing extremely high coverage, which may not provide additional benefit for variant detection beyond a certain point due to diminishing returns in variant frequency estimation and increased noise at very high coverage levels.

Generate a consensus FASTA sequence from the primer-trimmed BAM. * This step is optional but can be useful for downstream analyses that require a consensus sequence (e.g., lineage assignment, phylogenetics). * The consensus is generated using samtools mpileup piped into iVar consensus, which applies a minimum allele frequency threshold to call bases and masks low-coverage positions. * The parameters are set to retain as much data as possible while still producing a reliable consensus sequence. For example, the minimum allele frequency threshold is set to 60 % to call a base, which allows for some within-host diversity while still calling a clear consensus. The minimum depth threshold is set to 10×, and positions below this are masked with ‘N’ to indicate uncertainty. The quality 20 means that bases with a quality score below 20 will be ignored in the consensus calling, which helps to reduce the impact of sequencing errors on the consensus sequence. The -aa flag ensures that all positions are included in the output, even those with zero coverage, which is important for generating a full-length consensus sequence that can be used for downstream analyses. * The resulting consensus FASTA can be used for lineage assignment (e.g., with Pangolin) and phylogenetic analysis, while the accompanying quality file provides information on coverage and allele frequencies at each position.

samtools mpileup \
    -aa \           # output all positions (ensures full-length consensus)
    -A \            # include anomalous read pairs
    -d 1000000 \    # raise depth cap to 1,000,000× (default 8,000× would truncate high-coverage data)
    -Q 0 \          # include all bases regardless of quality
    Align/sample.bam \
| ivar consensus \
    -p Consensus/sample \   # output prefix (.fa and .qual.txt)
    -t 0.6 \                # minimum allele frequency to call a base (60 %)
    -m 10 \                 # minimum depth; positions below this are masked with -n character
    -n N                    # character used to mask low-coverage positions

high-depth viral sequencing is that the -d parameter (maximum depth) in samtools mpileup should generally be the same—and set very high—for both ivar variants and ivar consensus.
If you use a lower depth for consensus (e.g., -d 20000) than for variants (e.g., -d 0), you risk a “silent bug” where a variant is statistically significant in your variant file but is absent or replaced by an “N” in your consensus because the consensus caller was denied access to the reads supporting that variant

--- title: "iVar" --- iVar is a computational tool designed for the analysis of viral sequencing data, particularly in the context of identifying and quantifying variants within viral populations. It is commonly used in the field of virology and genomics to process high-throughput sequencing data, allowing researchers to detect mutations, track viral evolution, and understand the dynamics of viral infections. - https://andersen-lab.github.io/ivar/html/ **iVar primer trimming** * it is essential to remove artefactual sequences that can bias variant calling and consensus generation. * reads that truly overlap the primer regions are trimmed, while reads that do not align to primer sites are left intact. * iVar uses the BED file to locate the primer sequences at the start and end of the aligned reads and replaces those bases with 'S' (soft-clip) in the BAM file. * The resulting BAM file contains reads with the primers soft-clipped, preventing the primer sequence itself from introducing false variants. **iVar Quality trimming** * iVar uses a sliding window approach (Default: 4) to do quality trimming. The windows slides from the 5' end to the 3' end and if at any point the average base quality in the window falls below the threshold (Default: 20), the remaining read is soft clipped. * If after trimming, the length of the read is greater than the minimum length specified (Default: 30), the read is written to the new trimmed BAM file. ```bash # Index the sorted BAM before trimming samtools index Align/sample.sorted.bam # Trim primers ivar trim \ -i Align/sample.sorted.bam \ # input sorted BAM -b primers.bed \ # primer coordinates BED file (generated in step 0.1) -e \ # include reads that extend beyond the primer region -p Align/sample_trimmed # output prefix (produces sample_trimmed.bam) ``` why include reads that extend beyond the primer region? Including reads that extend beyond the primer region allows iVar to trim only the portion of the read that overlaps the primer, while retaining the rest of the read sequence for downstream analysis. This is important because many reads may partially overlap the primer sites, and excluding them entirely would result in loss of valuable data. By allowing reads that extend beyond the primer region, we can maximize the amount of usable data while still effectively removing primer sequences that could bias variant calling. iVar uses a sliding window approach (Default: 4) to do quality trimming. The windows slides from the 5' end to the 3' end and if at any point the average base quality in the window falls below the threshold (Default: 20), the remaining read is soft clipped. If after trimming, the length of the read is greater than the minimum length specified (Default: 30), the read is written to the new trimmed BAM file. By using a sliding window rather than a single-base check, iVar avoids being "fooled" by a single bad base in the middle of a high-quality stretch. It ensures that trimming only happens when there is a consistent trend of poor data. Call nucleotide variants from the primer-trimmed BAM using samtools mpileup piped into iVar. ```bash samtools mpileup \ -aa \ # output ALL positions, including zero-coverage sites -B \ # disable BAQ (per-base alignment quality) computation -A \ # include anomalous read pairs, which can be informative for variant calling in amplicon data -d 20000 \ # max per-position read depth cap (20,000×), raised from default 8,000× to avoid truncating high-coverage amplicon data --reference KT992094.1.fasta \ # reference FASTA (required when using -aa) -Q 0 \ # include bases of any quality (iVar does its own filtering), to maximize data retained for variant calling Align/sample.bam \ | ivar variants \ -p iVAR_Variant/sample \ # output prefix (.tsv) -r KT992094.1.fasta \ # reference FASTA (for amino-acid annotation) -g KT992094.1.gff3 # GFF3 annotation (for codon/amino-acid annotation) ``` * samtools mpileup * samtools mpileup processes reads in the order they appear in your BAM file. Because mpileup requires a coordinate-sorted BAM file, it encounters reads based on their starting genomic position * As soon as it reaches the 20,000th read for that specific position, it stops accepting any further reads for that coordinate * Passing zero for this option sets it to the highest possible value, effectively removing the depth limit. [8000] * Why disable BAQ with `-B` for amplicon data? BAQ (per-base alignment quality) is designed to adjust base quality scores near indels to account for potential misalignments. However, in amplicon sequencing data, which often has very high coverage and may contain many reads that partially overlap primer sites, BAQ can be overly aggressive in down-weighting bases near indels, leading to loss of true variant signals. Disabling BAQ with `-B` allows iVar to apply its own variant calling filters without the influence of BAQ adjustments, which can improve sensitivity for detecting variants in amplicon data where indels may be common and coverage is high enough to support confident variant calls without BAQ's adjustments. * Why setup the read depth cap to 20,000× with `-d`? Setting the read depth cap to 20,000× allows for the retention of high-coverage data that is common in amplicon sequencing, while still preventing excessive memory usage and computational time during variant calling. The default depth cap of 8,000× in samtools mpileup may truncate data in high-coverage amplicon datasets, leading to loss of variant information. By raising the cap to 20,000×, we can ensure that we capture the full depth of coverage available for variant calling, which can improve sensitivity for detecting low-frequency variants. However, setting the cap too high (e.g., 1,000,000×) could lead to excessive computational overhead without providing additional benefit for variant detection, as the accuracy of variant frequency estimation may not improve significantly beyond a certain coverage level due to diminishing returns and increased noise at very high coverage levels. * Why not use higher read depth cap (e.g., 1,000,000×) for variant calling? While a higher read depth cap (e.g., 1,000,000×) could be used to ensure that no data is truncated, it may lead to excessive memory usage and computational time during variant calling, especially for high-coverage amplicon data. Setting a cap of 20,000× strikes a balance between retaining sufficient data for accurate variant calling while avoiding the computational overhead associated with processing extremely high coverage, which may not provide additional benefit for variant detection beyond a certain point due to diminishing returns in variant frequency estimation and increased noise at very high coverage levels. Generate a consensus FASTA sequence from the primer-trimmed BAM. * This step is optional but can be useful for downstream analyses that require a consensus sequence (e.g., lineage assignment, phylogenetics). * The consensus is generated using samtools mpileup piped into iVar consensus, which applies a minimum allele frequency threshold to call bases and masks low-coverage positions. * The parameters are set to retain as much data as possible while still producing a reliable consensus sequence. For example, the minimum allele frequency threshold is set to 60 % to call a base, which allows for some within-host diversity while still calling a clear consensus. The minimum depth threshold is set to 10×, and positions below this are masked with 'N' to indicate uncertainty. The quality 20 means that bases with a quality score below 20 will be ignored in the consensus calling, which helps to reduce the impact of sequencing errors on the consensus sequence. The -aa flag ensures that all positions are included in the output, even those with zero coverage, which is important for generating a full-length consensus sequence that can be used for downstream analyses. * The resulting consensus FASTA can be used for lineage assignment (e.g., with Pangolin) and phylogenetic analysis, while the accompanying quality file provides information on coverage and allele frequencies at each position. ```bash samtools mpileup \ -aa \ # output all positions (ensures full-length consensus) -A \ # include anomalous read pairs -d 1000000 \ # raise depth cap to 1,000,000× (default 8,000× would truncate high-coverage data) -Q 0 \ # include all bases regardless of quality Align/sample.bam \ | ivar consensus \ -p Consensus/sample \ # output prefix (.fa and .qual.txt) -t 0.6 \ # minimum allele frequency to call a base (60 %) -m 10 \ # minimum depth; positions below this are masked with -n character -n N # character used to mask low-coverage positions ``` * high-depth viral sequencing is that the -d parameter (maximum depth) in samtools mpileup should generally be the same—and set very high—for both ivar variants and ivar consensus. * If you use a lower depth for consensus (e.g., -d 20000) than for variants (e.g., -d 0), you risk a "silent bug" where a variant is statistically significant in your variant file but is absent or replaced by an "N" in your consensus because the consensus caller was denied access to the reads supporting that variant *