Methods for Differential Expression Analysis

RNA-seq
DESeq2
EdgeR
EBSeq
Created

Monday, July 28, 2025

Modified

Sunday, March 15, 2026

DESeq2

  • Normalization method: median of ratios
  • Model: Generalized Linear Model (GLM) with negative binomial distribution
  • Statistical test: Wald test or Likelihood Ratio Test (LRT)
  • Multiple testing correction: Benjamini-Hochberg (BH) method to control the false discovery rate (FDR)
  • Output: log2 fold change, p-value, adjusted p-value
  • Application: datasets with small to moderate number of replicates (e.g., 3-5 replicates per condition); robust to outliers and low counts; can handle complex experimental designs

Running Example Dataset

Experimental design: 2 groups × 3 biological replicates = 6 samples total, 5 genes.

  • sampleA (control): A1, A2, A3
  • sampleB (treatment): B1, B2, B3
Gene A1 A2 A3 B1 B2 B3 Expected
ACTB 95 100 105 140 150 160 non-DE (housekeeping)
GAPDH 190 200 210 280 300 320 non-DE (housekeeping)
EEF2 76 80 84 112 120 128 non-DE (housekeeping)
MYC 48 50 52 280 300 320 4× upregulated in B
TP53 152 160 168 56 60 64 4× downregulated in B
Note

The B group has ~1.3–1.5× more total reads than the A group — a sequencing depth difference, not a biological difference. Normalization must correct for this before comparing expression levels.

Normalization

Goal: Account for differences in sequencing depth and RNA composition between samples so that counts are comparable. DESeq2 uses the median of ratios method.

Step 1 — Geometric mean per gene (pseudo-reference)

For each gene, compute the geometric mean across all 6 samples. This serves as the reference baseline.

\[\text{geo\_mean}_i = \left(\prod_{j=1}^{6} y_{ij}\right)^{1/6} = \exp\!\left(\frac{1}{6}\sum_{j=1}^{6} \ln y_{ij}\right)\]

where \(i\) = gene index, \(j\) = sample index (\(j = 1,\ldots,6\)), \(y_{ij}\) = raw count of gene \(i\) in sample \(j\), \(n = 6\) = total number of samples.

Gene Calculation Geo Mean
ACTB \((95 \times 100 \times 105 \times 140 \times 150 \times 160)^{1/6}\) 122
GAPDH \((190 \times 200 \times 210 \times 280 \times 300 \times 320)^{1/6}\) 244
EEF2 \((76 \times 80 \times 84 \times 112 \times 120 \times 128)^{1/6}\) 98
MYC \((48 \times 50 \times 52 \times 280 \times 300 \times 320)^{1/6}\) 122
TP53 \((152 \times 160 \times 168 \times 56 \times 60 \times 64)^{1/6}\) 98
Note

MYC and ACTB share the same geometric mean (~122) even though MYC is DE. This is because the geometric mean averages across both groups, reflecting the overall pool of counts — it is not an estimate of true expression.

Step 2 — Ratio of each sample count to the gene’s geometric mean

For every sample \(j\) and gene \(i\): \(r_{ij} = y_{ij} / \text{geo\_mean}_i\)

where \(r_{ij}\) = ratio of the raw count to the pseudo-reference, \(y_{ij}\) = raw count of gene \(i\) in sample \(j\), \(\text{geo\_mean}_i\) = geometric mean of gene \(i\) across all samples.

Gene Geo Mean A1 A2 A3 B1 B2 B3
ACTB 122 0.78 0.82 0.86 1.15 1.23 1.31
GAPDH 244 0.78 0.82 0.86 1.15 1.23 1.31
EEF2 98 0.78 0.82 0.86 1.14 1.22 1.31
MYC 122 0.39 0.41 0.43 2.30 2.46 2.62
TP53 98 1.55 1.63 1.71 0.57 0.61 0.65

Notice: ACTB, GAPDH, EEF2 all have near-identical ratios within each sample (they track the sequencing depth). MYC and TP53 deviate because they are truly differentially expressed.

Step 3 — Size factor = median of ratios per sample

The median is taken column-wise across all 5 genes for each sample. Because the DE genes (MYC, TP53) go in opposite directions, their extreme ratios cancel out and the median reflects only sequencing depth.

# Column-wise median of ratios
size_factor_A1 <- median(c(0.78, 0.78, 0.78, 0.39, 1.55))  # = 0.78
size_factor_A2 <- median(c(0.82, 0.82, 0.82, 0.41, 1.63))  # = 0.82
size_factor_A3 <- median(c(0.86, 0.86, 0.86, 0.43, 1.71))  # = 0.86
size_factor_B1 <- median(c(1.15, 1.15, 1.14, 2.30, 0.57))  # = 1.15
size_factor_B2 <- median(c(1.23, 1.23, 1.22, 2.46, 0.61))  # = 1.23
size_factor_B3 <- median(c(1.31, 1.31, 1.31, 2.62, 0.65))  # = 1.31
Sample A1 A2 A3 B1 B2 B3
Size factor 0.78 0.82 0.86 1.15 1.23 1.31

The A samples have size factors < 1 (lower depth) and B samples > 1 (higher depth), exactly reflecting the sequencing depth difference.

Step 4 — Normalized counts = raw count ÷ size factor

Gene A1 A2 A3 B1 B2 B3 Group A mean Group B mean
ACTB 122 122 122 122 122 122 122 122
GAPDH 244 244 244 243 244 244 244 244
EEF2 97 98 98 97 98 98 98 98
MYC 62 61 60 243 244 244 61 244
TP53 195 195 195 49 49 49 195 49

After normalization: ACTB, GAPDH, and EEF2 have identical levels across both groups (as expected). MYC shows ~4× higher expression in B, and TP53 shows ~4× lower expression in B.

Important

DESeq2 does not actually use normalized counts as input to the GLM. Instead, it passes the raw counts plus the size factors as an offset in the model. The normalized counts shown here are useful for visualization only.


Dispersion estimation

Goal: Estimate how much the counts for each gene vary across replicates beyond what Poisson noise alone would predict. This is the dispersion parameter \(\alpha\) in the Negative Binomial model:

\[\text{Var}(Y_{ij}) = \mu_{ij} + \alpha_i\,\mu_{ij}^2\]

where \(Y_{ij}\) = observed count for gene \(i\) in sample \(j\), \(\mu_{ij}\) = expected (mean) count, \(\alpha_i\) = gene-specific dispersion parameter (always \(\geq 0\)).

When \(\alpha_i = 0\), this reduces to Poisson. A larger \(\alpha\) means more variability between replicates.

Why not just use Poisson?

RNA-seq counts are overdispersed: biological variation between replicates makes the variance larger than the mean. Ignoring this (using Poisson) leads to anti-conservative p-values and many false positives. The NB model explicitly accounts for this extra variability.

The three-step shrinkage procedure:

DESeq2 does not estimate dispersion independently per gene (unreliable with only 3 replicates). Instead it borrows information across all genes: 1. Gene-wise estimation: Compute a raw dispersion estimate for each gene using the method of moments or MLE. 2. Trend fitting: Fit a smooth curve of dispersion vs mean expression across all genes to capture the overall relationship. 3. Shrinkage: For each gene, shrink its raw dispersion estimate towards the fitted trend value, with the amount of shrinkage depending on the gene’s mean expression and the variability of the raw estimates.

The final (shrunken) dispersion estimates for our example genes:

Gene Mean expression Estimated \(\alpha\) Interpretation
ACTB 122 0.00 Perfectly uniform replicates (ideal)
GAPDH 244 0.00 Perfectly uniform replicates (ideal)
EEF2 98 0.00 Perfectly uniform replicates (ideal)
MYC 61 (A) / 244 (B) 0.04 Low dispersion, high counts
TP53 195 (A) / 49 (B) 0.05 Slightly higher dispersion, lower counts in B
Note

In real data, even housekeeping genes have some dispersion (e.g., α ≈ 0.01–0.05). The perfectly uniform values above are for illustration only.


GLM fitting

Goal: For each gene, fit a Negative Binomial Generalized Linear Model to estimate the log2 fold change (\(\hat\beta_1\)) between conditions.

The model for one gene, one sample:

\[\log(\mu_j) = \beta_0 + \beta_1 x_j + \log(s_j)\]

Term Meaning
\(\mu_j\) Expected raw count in sample \(j\)
\(\beta_0\) Log mean expression in Group A (intercept)
\(\beta_1\) Log fold change (B vs A) — the key parameter
\(x_j\) Group indicator: 0 = sampleA, 1 = sampleB
\(s_j\) Size factor (offset — normalization built into the model)

The term \(\log(s_j)\) is a fixed offset, so the model estimates the biological differences only.

Estimating \(\hat\beta_1\) for MYC (via MLE)

From the MLE score equations, the fitted group means satisfy:

\[e^{\hat\beta_0} = \frac{\text{sum of raw counts}_A}{\text{sum of size factors}_A} = \frac{48+50+52}{0.78+0.82+0.86} = \frac{150}{2.46} = 61.0\]

\[e^{\hat\beta_0+\hat\beta_1} = \frac{\text{sum of raw counts}_B}{\text{sum of size factors}_B} = \frac{280+300+320}{1.15+1.23+1.31} = \frac{900}{3.69} = 243.9\]

\[\hat\beta_1 = \ln\!\left(\frac{243.9}{61.0}\right) = \ln(4.0) = 1.386 \text{ (natural log)}\]

\[\boxed{\text{log2FC} = \frac{\hat\beta_1}{\ln 2} = \frac{1.386}{0.693} = 2.0}\]

A log2FC of 2.0 means MYC is \(2^2 = 4\times\) higher in sampleB. ✓

Estimating \(\hat\beta_1\) for TP53

\[e^{\hat\beta_0} = \frac{152+160+168}{0.78+0.82+0.86} = \frac{480}{2.46} = 195.1\]

\[e^{\hat\beta_0+\hat\beta_1} = \frac{56+60+64}{1.15+1.23+1.31} = \frac{180}{3.69} = 48.8\]

\[\hat\beta_1 = \ln\!\left(\frac{48.8}{195.1}\right) = -1.385 \qquad \text{log2FC} = -2.0\]

A log2FC of −2.0 means TP53 is \(2^2 = 4\times\) lower in sampleB. ✓

Computing SE(\(\hat\beta_1\)) via Fisher Information

The variance of \(\hat\beta_1\) comes from the weighted Fisher information matrix \((\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\), where the NB weight for each sample is:

\[w_j = \frac{\mu_j}{1 + \alpha\,\mu_j}\]

where \(w_j\) = NB weight for sample \(j\) in the Fisher information matrix, \(\mu_j\) = fitted count for sample \(j\) under the estimated model, \(\alpha\) = gene-specific dispersion parameter.

For MYC (\(\alpha = 0.04\)), fitted counts \(\mu_j = s_j \times e^{\hat\beta_0 + \hat\beta_1 x_j}\):

Sample Group \(\mu_j\) \(w_j = \mu_j/(1+0.04\mu_j)\)
A1 A 0.78 × 61 = 48 48/(1+1.92) = 16.4
A2 A 0.82 × 61 = 50 50/(1+2.00) = 16.6
A3 A 0.86 × 61 = 52 52/(1+2.08) = 16.9
B1 B 1.15 × 244 = 281 281/(1+11.24) = 22.9
B2 B 1.23 × 244 = 300 300/(1+12.00) = 23.1
B3 B 1.31 × 244 = 320 320/(1+12.80) = 23.2
\(\Sigma w\) = 119.1 \(\Sigma_{B} w\) = 69.2

\[\text{Var}(\hat\beta_1) = \frac{\Sigma w}{\Sigma w \cdot \Sigma_B w - (\Sigma_B w)^2} = \frac{119.1}{119.1 \times 69.2 - 69.2^2} = \frac{119.1}{69.2 \times 50.0} = 0.034\]

where \(\Sigma w\) = sum of weights across all 6 samples, \(\Sigma_B w\) = sum of weights for Group B samples only (\(\Sigma_A w = \Sigma w - \Sigma_B w\)).

\[\text{SE}(\hat\beta_1) = \sqrt{\text{Var}(\hat\beta_1)} = \sqrt{0.034} = 0.186\]


Statistical testing

Goal: Test whether \(\hat\beta_1 \neq 0\) for each gene — i.e., is the fold change statistically significant?

Wald Test (default)

\[W = \frac{\hat\beta_1}{\text{SE}(\hat\beta_1)} \sim \mathcal{N}(0,1) \text{ under } H_0\]

\[p = 2 \times \Phi(-|W|)\]

where \(W\) = Wald statistic, \(\hat\beta_1\) = estimated log fold change (natural log scale), \(\text{SE}(\hat\beta_1)\) = standard error of the estimate, \(\mathcal{N}(0,1)\) = standard normal distribution, \(H_0: \beta_1 = 0\) (no difference between groups), \(\Phi(\cdot)\) = standard normal CDF, \(p\) = two-sided p-value.

For MYC:

\[W_{MYC} = \frac{1.386}{0.186} = 7.45\]

\[p_{MYC} = 2 \times \Phi(-7.45) \approx 9.4 \times 10^{-14}\]

For TP53 (with SE = 0.202 from similar calculation with \(\alpha=0.05\)):

\[W_{TP53} = \frac{-1.385}{0.202} = -6.85\]

\[p_{TP53} = 2 \times \Phi(-6.85) \approx 7.5 \times 10^{-12}\]

For non-DE genes (ACTB, GAPDH, EEF2): \(\hat\beta_1 \approx 0\), so \(W \approx 0\) and \(p \approx 1\).

Results (before multiple testing correction):

Gene \(\hat\beta_1\) log2FC SE Wald stat p-value
MYC 1.386 2.00 0.186 7.45 \(9.4 \times 10^{-14}\)
TP53 −1.385 −2.00 0.202 −6.85 \(7.5 \times 10^{-12}\)
ACTB ≈0 0.00 ≈0 ≈0.90
GAPDH ≈0 0.00 ≈0 ≈0.95
EEF2 ≈0 0.00 ≈0 ≈0.99
Note

When to use LRT instead? The Wald test is used for pairwise comparisons (A vs B). The Likelihood Ratio Test (LRT) is better for complex designs, e.g., testing whether a factor explains variance across ≥3 groups, or for comparing nested models (full vs reduced design formula).


Multiple testing correction

Problem: When testing thousands of genes simultaneously, even a threshold of \(p < 0.05\) produces many false positives by chance. With 20,000 genes tested, ~1,000 would appear significant purely by chance.

Solution: Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) — the expected proportion of significant results that are false positives.

BH procedure (illustrated with our 5 genes):

Step 1 — Sort genes by p-value (ascending):

Rank \(k\) Gene p-value BH threshold \(\frac{k}{m} \times \text{FDR}\) Significant?
1 MYC \(9.4 \times 10^{-14}\) \(\frac{1}{5} \times 0.05 = 0.010\) \(p < 0.010\)
2 TP53 \(7.5 \times 10^{-12}\) \(\frac{2}{5} \times 0.05 = 0.020\) \(p < 0.020\)
3 ACTB 0.90 \(\frac{3}{5} \times 0.05 = 0.030\) \(p > 0.030\)
4 GAPDH 0.95 \(\frac{4}{5} \times 0.05 = 0.040\) \(p > 0.040\)
5 EEF2 0.99 \(\frac{5}{5} \times 0.05 = 0.050\) \(p > 0.050\)

Step 2 — Compute adjusted p-values (padj):

\[\text{padj}_k = \min\!\left(1,\; \frac{m}{k} \times p_k\right) \text{ (enforced non-decreasing from the bottom)}\]

where \(k\) = rank of the gene when all genes are sorted by p-value ascending (1 = smallest p-value), \(m\) = total number of genes tested, \(p_k\) = raw p-value of the gene at rank \(k\).

Gene p-value padj DEG?
MYC \(9.4 \times 10^{-14}\) \(\mathbf{4.7 \times 10^{-13}}\)
TP53 \(7.5 \times 10^{-12}\) \(\mathbf{1.9 \times 10^{-11}}\)
ACTB 0.90 1.00
GAPDH 0.95 1.00
EEF2 0.99 1.00
Tip

In practice with thousands of genes, many non-DE genes will have padj between 0 and 1. Genes are typically called DEGs when padj < 0.05 AND |log2FC| ≥ 1 (2-fold change).

Final DESeq2 output table (results()):

Gene baseMean log2FC lfcSE stat pvalue padj
MYC 191.7 2.00 0.186 7.45 \(9.4 \times 10^{-14}\) \(4.7 \times 10^{-13}\)
TP53 116.0 −2.00 0.202 −6.85 \(7.5 \times 10^{-12}\) \(1.9 \times 10^{-11}\)
ACTB 122.0 0.00 ≈0 0.90 1.00
GAPDH 244.0 0.00 ≈0 0.95 1.00
EEF2 98.0 0.00 ≈0 0.99 1.00

Command

library(DESeq2)

# ── 1. Build DESeqDataSet ────────────────────────────────────────────────────
# count_matrix : genes × samples matrix of raw integer counts
# metadata     : data.frame with rows = samples, columns = covariates
dds <- DESeqDataSetFromMatrix(
    countData = count_matrix,
    colData   = metadata,
    design    = ~ condition      # 'condition' column contains "sampleA"/"sampleB"
)

# ── 2. Pre-filtering: drop genes with sum of counts < 10 ────────────────────
dds <- dds[rowSums(counts(dds)) >= 10, ]

# ── 3. Set reference level (control group first) ────────────────────────────
dds$condition <- relevel(dds$condition, ref = "sampleA")

# ── 4. Run full DESeq2 pipeline in one call ─────────────────────────────────
#   Internally: estimateSizeFactors → estimateDispersions → nbinomWaldTest
dds <- DESeq(dds)

# ── 5. QC: inspect size factors and dispersion ──────────────────────────────
sizeFactors(dds)
plotDispEsts(dds)

# ── 6. Extract results (sampleB vs sampleA) ─────────────────────────────────
res <- results(
    dds,
    contrast = c("condition", "sampleB", "sampleA"),
    alpha    = 0.05          # used for independent filtering threshold
)
summary(res)
# Columns: baseMean | log2FoldChange | lfcSE | stat | pvalue | padj

# ── 7. LFC shrinkage (recommended for visualisation and ranking) ─────────────
#   apeglm shrinkage reduces noise for low-count genes
res_shrunk <- lfcShrink(
    dds,
    coef = "condition_sampleB_vs_sampleA",
    type = "apeglm"
)

# ── 8. Filter significant DEGs ───────────────────────────────────────────────
degs <- subset(res_shrunk, padj < 0.05 & abs(log2FoldChange) >= 1)
degs <- degs[order(degs$padj), ]
nrow(degs)       # number of DEGs

# ── 9. Exploratory plots ─────────────────────────────────────────────────────
# Variance-stabilising transformation for PCA / heatmap
vsd <- vst(dds, blind = TRUE)
plotPCA(vsd, intgroup = "condition")

# MA plot: log2FC vs mean expression; red = significant
plotMA(res_shrunk, ylim = c(-5, 5))

# ── 10. Save results ─────────────────────────────────────────────────────────
write.csv(as.data.frame(res_shrunk), "DESeq2_results.csv", row.names = TRUE)
write.csv(as.data.frame(degs),       "DESeq2_DEGs.csv",     row.names = TRUE)

EdgeR

  • Normalization method: Trimmed Mean of M-values (TMM)
  • Model: Negative Binomial GLM or Exact Test
  • Dispersion: common → trended → tagwise (empirical Bayes shrinkage)
  • Statistical test: Quasi-likelihood F-test (glmQLFTest) or Likelihood Ratio Test (glmLRT)
  • Multiple testing correction: Benjamini-Hochberg (BH)
  • Output: log2 fold change, p-value, FDR
  • Application: best for small sample sizes (n ≥ 2); supports complex multi-factor designs; slightly more powerful than DESeq2 when dispersion is low

Running Example Dataset

Same 5-gene × 6-sample dataset used for DESeq2 (see above).

Gene A1 A2 A3 B1 B2 B3 Expected
ACTB 95 100 105 140 150 160 non-DE
GAPDH 190 200 210 280 300 320 non-DE
EEF2 76 80 84 112 120 128 non-DE
MYC 48 50 52 280 300 320 4× up in B
TP53 152 160 168 56 60 64 4× down in B

Normalization

Goal: Correct for sequencing depth and RNA composition. EdgeR uses TMM (Trimmed Mean of M-values), which corrects for the compositional bias that occurs when a few highly expressed genes dominate the library.

Key idea: Two summary statistics are computed for each gene in each sample (relative to a reference sample):

\[M_g = \log_2\!\left(\frac{y_{gk}/N_k}{y_{gR}/N_R}\right), \qquad A_g = \frac{1}{2}\log_2\!\left(\frac{y_{gk}}{N_k} \cdot \frac{y_{gR}}{N_R}\right)\]

where \(g\) = gene index, \(k\) = sample being normalized, \(R\) = reference sample (chosen as the sample whose 75th percentile count is closest to the mean 75th percentile), \(y_{gk}\) = raw count of gene \(g\) in sample \(k\), \(y_{gR}\) = raw count of gene \(g\) in the reference sample, \(N_k\) = total library size of sample \(k\), \(N_R\) = total library size of reference sample, \(M_g\) = log-ratio (M-value) of expression between sample \(k\) and reference, \(A_g\) = average log-expression (A-value).

TMM trims the top/bottom 30% of M-values and top/bottom 5% of A-values, then computes a weighted mean of the remaining M-values:

\[\log_2(\text{TMM}_k) = \frac{\sum_{g \notin \text{trimmed}} w_g^* M_g}{\sum_{g \notin \text{trimmed}} w_g^*}\]

where \(\text{TMM}_k\) = TMM scaling factor for sample \(k\), and \(w_g^*\) = precision weight for gene \(g\):

\[w_g^* = \frac{N_k - y_{gk}}{N_k \cdot y_{gk}} + \frac{N_R - y_{gR}}{N_R \cdot y_{gR}}\]

Effective library sizes then replace raw library sizes in all downstream calculations:

\[\tilde{N}_k = N_k \times \text{TMM}_k\]

where \(\tilde{N}_k\) = effective library size for sample \(k\), used in place of raw \(N_k\) when computing CPM or fitting the model.

For our example:

Sample Raw library size \(N_k\) TMM factor Effective size \(\tilde{N}_k\)
A1 561 0.79 443
A2 590 0.82 484
A3 619 0.86 532
B1 868 1.16 1007
B2 930 1.23 1144
B3 992 1.30 1290
Note

TMM vs median-of-ratios (DESeq2): Both methods account for sequencing depth and RNA composition. TMM trims extreme fold-change genes before computing the normalization factor; DESeq2 uses the median across all genes. In practice they produce very similar results. TMM is slightly more aggressive at removing outlier genes.


Dispersion estimation

EdgeR estimates dispersion in three steps, borrowing information across genes:

Step 1 — Common dispersion: A single dispersion value \(\phi_c\) is estimated for all genes jointly using Cox-Reid adjusted profile likelihood. This gives a stable starting estimate:

\[\phi_c = \arg\max_\phi \; \text{CRAPL}(\phi \mid \text{all genes})\]

where \(\phi_c\) = common dispersion (one value for all genes), \(\text{CRAPL}\) = Cox-Reid adjusted profile log-likelihood (accounts for the degrees of freedom used in estimating the mean).

Step 2 — Trended dispersion: A smooth curve is fit relating dispersion to mean expression (higher-expressed genes tend to have lower dispersion):

\[\log\hat\phi_g^{\text{trend}} = f(\log\bar\mu_g)\]

where \(\hat\phi_g^{\text{trend}}\) = trended dispersion estimate for gene \(g\), \(\bar\mu_g\) = mean normalized count of gene \(g\) across samples, \(f(\cdot)\) = locally weighted regression (loess) smooth fit.

Step 3 — Tagwise dispersion: Each gene’s dispersion is shrunk toward the trended value using empirical Bayes, weighted by the number of replicates and the variability of raw estimates:

\[\hat\phi_g^{\text{tag}} = \lambda_g \hat\phi_g^{\text{raw}} + (1-\lambda_g)\hat\phi_g^{\text{trend}}\]

where \(\hat\phi_g^{\text{tag}}\) = final tagwise (per-gene) dispersion, \(\hat\phi_g^{\text{raw}}\) = gene-wise dispersion from its own data only, \(\lambda_g \in [0,1]\) = shrinkage weight (smaller with fewer replicates → more shrinkage toward trend), \(\hat\phi_g^{\text{trend}}\) = trended dispersion at gene \(g\)’s mean expression.

For our example genes:

Gene Common \(\phi_c\) Trended \(\hat\phi^{\text{trend}}\) Tagwise \(\hat\phi^{\text{tag}}\)
ACTB 0.04 0.030 0.030
GAPDH 0.04 0.025 0.025
EEF2 0.04 0.032 0.032
MYC 0.04 0.030 0.040
TP53 0.04 0.032 0.048

GLM fitting

EdgeR fits the same Negative Binomial GLM as DESeq2, but uses effective library sizes instead of size factors:

\[\log(\mu_{gj}) = \beta_{g0} + \beta_{g1} x_j + \log(\tilde{N}_j)\]

where \(\mu_{gj}\) = expected count for gene \(g\) in sample \(j\), \(\beta_{g0}\) = log baseline expression in Group A, \(\beta_{g1}\) = log fold change (B vs A) for gene \(g\), \(x_j\) = group indicator (0 = A, 1 = B), \(\tilde{N}_j\) = effective library size (TMM-adjusted), used as an offset.

The \(\hat\beta_{g1}\) estimate and its SE follow the same MLE derivation as shown for DESeq2. For MYC using effective library sizes, the result is again \(\text{log2FC} \approx 2.0\).

Note

EdgeR offers two GLM approaches:

  • glmQLFit + glmQLFTest (recommended): Estimates a quasi-likelihood dispersion at the gene level using a squeeze toward a global value. More robust for small sample sizes — uses an F-test rather than a chi-squared test, giving better control of type I error.
  • glmFit + glmLRT: Classic likelihood ratio test. Slightly more powerful when the NB model assumptions hold perfectly.

Statistical testing

Quasi-likelihood F-test (recommended)

\[F_g = \frac{(D_g^{\text{full}} - D_g^{\text{reduced}}) / df_1}{s_g^2} \sim F(df_1, df_2)\]

where \(F_g\) = quasi-likelihood F-statistic for gene \(g\), \(D_g^{\text{full}}\) = deviance under the full model (with group effect), \(D_g^{\text{reduced}}\) = deviance under the reduced model (intercept only), \(df_1\) = difference in model parameters (= 1 for a two-group comparison), \(s_g^2\) = squeezed quasi-likelihood dispersion for gene \(g\), \(df_2\) = residual degrees of freedom after squeezing.

For our example (sampleB vs sampleA), the F-statistic for MYC gives a p-value comparable to DESeq2’s Wald test.

Results (before multiple testing correction):

Gene log2FC logFC (natural) F-stat p-value Expected
MYC 2.00 1.386 52.1 \(1.1 \times 10^{-12}\) ✅ DE
TP53 −2.00 −1.385 44.7 \(8.2 \times 10^{-11}\) ✅ DE
ACTB 0.00 ≈0 ≈0 ≈0.92 ❌ non-DE
GAPDH 0.00 ≈0 ≈0 ≈0.96 ❌ non-DE
EEF2 0.00 ≈0 ≈0 ≈0.98 ❌ non-DE

Final EdgeR output table (topTags()):

Gene logFC logCPM F PValue FDR
MYC 2.00 7.4 52.1 \(1.1 \times 10^{-12}\) \(5.5 \times 10^{-12}\)
TP53 −2.00 7.0 44.7 \(8.2 \times 10^{-11}\) \(2.1 \times 10^{-10}\)
ACTB 0.00 6.9 ≈0 0.92 1.00
GAPDH 0.00 7.9 ≈0 0.96 1.00
EEF2 0.00 6.6 ≈0 0.98 1.00

where logFC = log2 fold change, logCPM = average log2 counts per million (measure of expression level), F = quasi-likelihood F-statistic, PValue = raw p-value, FDR = Benjamini-Hochberg adjusted p-value.


Command

library(edgeR)

# ── 1. Create DGEList object ─────────────────────────────────────────────────
# counts   : raw count matrix (genes × samples)
# group    : factor vector of sample groups e.g. c("A","A","A","B","B","B")
dge <- DGEList(counts = count_matrix, group = group)

# ── 2. Filter lowly expressed genes ─────────────────────────────────────────
#   Keep genes with CPM > 1 in at least (min) replicates per group
keep <- filterByExpr(dge)
dge  <- dge[keep, , keep.lib.sizes = FALSE]

# ── 3. Normalise with TMM ────────────────────────────────────────────────────
dge <- calcNormFactors(dge, method = "TMM")
dge$samples   # inspect: norm.factors should be close to 1

# ── 4. Define design matrix ──────────────────────────────────────────────────
design <- model.matrix(~ 0 + group)            # no-intercept (one coef per group)
colnames(design) <- levels(group)
contrast_mat <- makeContrasts(sampleB - sampleA, levels = design)

# ── 5. Estimate dispersion ───────────────────────────────────────────────────
dge <- estimateDisp(dge, design)               # common → trended → tagwise
plotBCV(dge)                                   # biological coefficient of variation

# ── 6. Fit QL GLM ────────────────────────────────────────────────────────────
fit <- glmQLFit(dge, design)
plotQLDisp(fit)

# ── 7. Test for DE (QL F-test) ───────────────────────────────────────────────
qlf <- glmQLFTest(fit, contrast = contrast_mat)
summary(decideTests(qlf))

# ── 8. Extract top results ───────────────────────────────────────────────────
res <- topTags(qlf, n = Inf, sort.by = "FDR")$table
# Columns: logFC | logCPM | F | PValue | FDR

# ── 9. Filter DEGs ───────────────────────────────────────────────────────────
degs <- subset(res, FDR < 0.05 & abs(logFC) >= 1)

# ── 10. Save results ─────────────────────────────────────────────────────────
write.csv(res,  "EdgeR_results.csv", row.names = TRUE)
write.csv(degs, "EdgeR_DEGs.csv",    row.names = TRUE)

EBSeq

  • Normalization method: median normalization (similar to DESeq2 median-of-ratios)
  • Model: Empirical Bayes mixture model — each gene is modeled as coming from a DE or non-DE component
  • Framework: Bayesian — outputs posterior probability rather than frequentist p-values
  • Output: PPDE (posterior probability of differential expression), fold change
  • Multiple testing: controlled via PPDE threshold (PPDE > 0.95 → FDR ≈ 5%)
  • Application: particularly useful when wanting probability-based inference; handles multi-group comparisons natively; robust when few replicates are available

Running Example Dataset

Same 5-gene × 6-sample dataset as above.


Normalization

EBSeq uses median normalization — conceptually the same as DESeq2’s median-of-ratios but computed more directly:

\[s_j = \text{median}_{g}\!\left(\frac{y_{gj}}{\bar{y}_g}\right), \qquad \bar{y}_g = \frac{1}{n}\sum_{j=1}^n y_{gj}\]

where \(s_j\) = size factor for sample \(j\), \(y_{gj}\) = raw count of gene \(g\) in sample \(j\), \(\bar{y}_g\) = arithmetic mean count of gene \(g\) across all \(n\) samples, \(\text{median}_g\) = median taken across all genes \(g\) for a fixed sample \(j\).

Note

EBSeq uses the arithmetic mean (not geometric mean) as the per-gene reference, whereas DESeq2 uses the geometric mean. For most datasets the resulting size factors are nearly identical. EBSeq’s normalization is equivalent to MedianNorm() in the package.

For our example:

Sample A1 A2 A3 B1 B2 B3
Size factor 0.77 0.81 0.85 1.14 1.22 1.30

Empirical Bayes Mixture Model

Core idea: Instead of asking “is the p-value small enough?”, EBSeq asks “what is the probability that this gene is differentially expressed, given the data?”

For each gene \(g\), EBSeq fits a two-component mixture of Negative Binomial distributions:

\[P(y_{gj} \mid \text{model}) = \pi_1 \cdot \text{NB}(\mu_{g}^{(1)}, \alpha^{(1)}) + \pi_0 \cdot \text{NB}(\mu_{gj}^{(0)}, \alpha^{(0)})\]

where \(\pi_1\) = prior probability that a gene is DE (mixture weight, estimated from all genes), \(\pi_0 = 1 - \pi_1\) = prior probability of non-DE, \(\text{NB}(\mu, \alpha)\) = Negative Binomial distribution with mean \(\mu\) and dispersion \(\alpha\), \(\mu_g^{(1)}\) = group-specific means under the DE component (different in A vs B), \(\mu_{gj}^{(0)}\) = single mean for all samples under the non-DE component.

Parameters are estimated by the EM algorithm:

Step What it does
E-step Compute posterior probability of DE for each gene given current parameters
M-step Update mixture weights \(\pi_1, \pi_0\) and NB parameters to maximize expected log-likelihood
Repeat Until convergence (change in parameters < tolerance)

PPDE — Posterior Probability of Differential Expression

After convergence, the key output for each gene is:

\[\text{PPDE}_g = P(\text{gene } g \text{ is DE} \mid \mathbf{y}_g) = \frac{\pi_1 \cdot L_g^{(1)}}{\pi_1 \cdot L_g^{(1)} + \pi_0 \cdot L_g^{(0)}}\]

where \(\text{PPDE}_g\) = posterior probability that gene \(g\) is DE, \(L_g^{(1)}\) = likelihood of gene \(g\)’s counts under the DE component, \(L_g^{(0)}\) = likelihood under the non-DE component, \(\pi_1, \pi_0\) = estimated mixture weights (prior DE/non-DE probabilities).

Decision rule: A gene is called DE when \(\text{PPDE}_g > \delta\) for some threshold \(\delta\). The expected FDR among all called genes is:

\[\widehat{\text{FDR}} = \frac{\sum_{g: \text{PPDE}_g > \delta} (1 - \text{PPDE}_g)}{|\{g: \text{PPDE}_g > \delta\}|}\]

where the numerator = expected number of false positives (sum of non-DE probabilities among called genes), the denominator = total number of genes called DE.

Setting \(\delta = 0.95\) controls FDR at approximately 5%.

For our example:

Gene \(L_g^{(1)}\) (DE) \(L_g^{(0)}\) (non-DE) PPDE Called DE?
MYC very high very low 0.9999
TP53 very high very low 0.9997
ACTB low high 0.0022
GAPDH low high 0.0018
EEF2 low high 0.0031

\(\widehat{\text{FDR}} = \frac{(1-0.9999)+(1-0.9997)}{2} = \frac{0.0004}{2} = 0.0002\) — well below 5%.

Note

PPDE vs p-value: PPDE directly answers “how likely is this gene to be DE?” — a probability statement. A raw p-value only answers “how unlikely is this data if the gene were not DE?” — a statement about the null hypothesis. High PPDE does not map directly to a specific p-value cutoff, but PPDE > 0.95 is roughly equivalent to FDR < 5%.

Final EBSeq output:

Gene PPDE PostFC (B/A) log2FC Status
MYC 0.9999 4.00 2.00 DE
TP53 0.9997 0.25 −2.00 DE
ACTB 0.0022 1.00 0.00 non-DE
GAPDH 0.0018 1.00 0.00 non-DE
EEF2 0.0031 1.00 0.00 non-DE

where PostFC = posterior fold change (B/A), computed as the ratio of posterior mean expression under the DE component; log2FC = \(\log_2(\text{PostFC})\).


Command

library(EBSeq)

# ── 1. Normalise counts ──────────────────────────────────────────────────────
# count_matrix : genes × samples raw count matrix
sizes <- MedianNorm(count_matrix)     # returns a vector of size factors
sizes

# ── 2. Define condition vector ───────────────────────────────────────────────
# Must be a factor; order matters — first level = reference (Group A)
conditions <- factor(c("A", "A", "A", "B", "B", "B"))

# ── 3. Run EBSeq ─────────────────────────────────────────────────────────────
#   ng     : number of EM iterations (default = 5, increase for stability)
#   sizeFactors : normalization factors from MedianNorm
eb_out <- EBTest(
    Data        = count_matrix,
    Conditions  = conditions,
    sizeFactors = sizes,
    maxround    = 5
)

# ── 4. Convergence check ─────────────────────────────────────────────────────
eb_out$Alpha          # dispersion estimates
eb_out$AllZero        # genes excluded due to all-zero counts

# ── 5. Get posterior probabilities ───────────────────────────────────────────
pp <- GetPPMat(eb_out)
# Columns: PPEE (non-DE probability) | PPDE (DE probability)
head(pp)

# ── 6. Call DEGs at FDR ≤ 5% ─────────────────────────────────────────────────
degs_list <- GetDEResults(eb_out, FDR = 0.05)
# $DEgenes   : character vector of DE gene names
# $PPMat     : full posterior probability matrix
# $Status    : "DE" / "EE" (equally expressed) / "uncertain"

length(degs_list$DEgenes)

# ── 7. Compute posterior fold changes ───────────────────────────────────────
fc_res <- PostFC(eb_out)
# $PostFC   : posterior fold change (condition2 / condition1)
# $RealFC   : simple ratio of group means
# $Direction: "Up" or "Down" relative to condition1

fold_changes <- fc_res$PostFC[degs_list$DEgenes]
log2fc       <- log2(fold_changes)

# ── 8. Combine into a results table ─────────────────────────────────────────
results_df <- data.frame(
    PPDE    = pp[, "PPDE"],
    PostFC  = fc_res$PostFC,
    log2FC  = log2(fc_res$PostFC),
    Status  = degs_list$Status
)
results_df <- results_df[order(-results_df$PPDE), ]

# ── 9. Save results ──────────────────────────────────────────────────────────
write.csv(results_df, "EBSeq_results.csv", row.names = TRUE)
Back to top