RNAseq

Overview

Typical RNAseq workflow

Questions:

Quantification: Expression rates

Splicing variants: exon arrangements

Novel transcripts

RNA editing: single nucleotide variations (SNV)

Practical: 1. poly-A RNA extraction (mRNA extraction Kit) 2. reverse transcription into c-DNA 3. shearing (for illumina, not for nanopore) 4. library prep (depends on which seq. technique) 5. sequencing and getting the reads (fastq)

Bioinformatics: 1. Preprocessing fastq files using: Trimmomatic, cutadapt, trimgalore 1. adapter removal 2. trimming of bad quality 2. Mapping 1. no genome: de novo transcriptom construction 2. known genome: align reads against it 1. Quantification: accept only known/described transcripts 2. Splicing variants: allow alternate arrangements of known transcripts 3. Novel transcripts: allow alignment to exons and other regions 3. Mapping QC with RNA-SeQC, discard reads if: 1. cannot be uniquely mapped 2. alignment overlaps with several genes 3. bad alignment quality score 4. paired-end reads (illumina) do not map to the same genes 4. Expression Quantification (normalization of read counts) 1. correct read length bias (longer genes produce more reads) 2. correct Batch effects (labs and preparation time - different coverage between samples) 3. correct technical biases (use of different machines)

Types of normalization

You need to know: How was RNA extracted and the library build Which Sequencing platform single or basepaired ends? How many reads per sample? * QC reports?

Normalization for: library size, gene length, RNA composition (comparison between samples) * Methods: * RPKM/FPKM (size and length) * not recommended if you compare between different samples * Reads per kilo base per million mapped reads (RPKM) * use only to report expression values NOT differential expression * TPM (size and length) * similar to RPKM but uses transcripts instead of reads * the presenter doesn't use this method * TMM and DESeq (size and composition) * TMM considers the tissue type expression variation * suitable for differential expression between sample groups

Data visualization is done with MA-plot or vulcano plot(check the tutorial afterwards)

Figure: Black is not significant, red are significant changes heatmaps is the most common representation for differential expression analysis (R/Bioconducter heatmap and ggplot package) Another way to see how samples cluster is with PCA plots

Function enrichment analysis: GOSeq/topGO/GAGE (R package) DAVID

Differential Expression - practical

For this tutorial we used the test data from this paper. This is a highly comprehensive tutorial paper for RNAseq.

Summary:

  • We have two two commercially available RNA samples:
  • Universal Human Reference (UHR)
    • 10 cancer cell lines
  • Human Brain Reference (HBR)
    • isolated from brains of 23 Caucasians (m & f) mostly 60-80 years old
  • a spike-in control was added to each sample (ERCC ExFold RNA). Spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies). This allows to test the degree to which the RNA-seq assay (including all laboratory and analysis steps) accurately reflects the relative abundance of transcript species within a sample

  • two 'mixes' of spike-in control to allow assessment of differential expression output between samples

  • 3 complete experimental replicates for each sample to assess the technical variability of our overall process of producing RNA-seq data in the lab

  • UHR + ERCC Spike-In Mix1, Replicate 1

  • UHR + ERCC Spike-In Mix1, Replicate 2
  • UHR + ERCC Spike-In Mix1, Replicate 3
  • HBR + ERCC Spike-In Mix2, Replicate 1
  • HBR + ERCC Spike-In Mix2, Replicate 2
  • HBR + ERCC Spike-In Mix2, Replicate 3
  • For technical details, like Kits, rRNA removal, sample concentrations etc. read the S2 Data file from the paper.

Data available at:

curl -O -J -L https://osf.io/7zepj/download
tar xzf toy_rna.tar.gz

salmon for indexing and quantification of reads

Quantify reads with salmon

salmon index -t chr22_transcripts.fa -i chr22_index
for i in *_R1.fastq.gz
do
   prefix=$(basename $i _R1.fastq.gz)
   salmon quant -i chr22_index --libType A \
          -1 ${prefix}_R1.fastq.gz -2 ${prefix}_R2.fastq.gz -o quant/${prefix};
done

R-studio analysis

library(tximport)
library(GenomicFeatures)
library(readr)
txdb <- makeTxDbFromGFF("chr22_genes.gtf")
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, keys = k, keytype = "TXNAME", columns = "GENEID")
# we imported a file were each sample name is stored under the column "sample"
    samples <- read.table("samples.txt", header = TRUE)
# we save the path to each quant.sf file using the name of each sample in the samples file
    files <- file.path("quant", samples$sample, "quant.sf")
# we renamed the columns name of each path we extracted
    names(files) <- paste0(samples$sample)
# imports the quant files, txi.salmon is a datalist e.g. salmon$counts for the counts list-entry
    txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)

Differential expression using DESeq2 + DESeqDataSet is a subclass of RangedSummarizedExperiment, used to store the input values, intermediate calculations and results of an analysis of differential expression.

library(DESeq2) #loading module
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
dds <- DESeq(dds)
res <- results(dds) #summary(res) to see the results
plotDispEsts(dds, main="Dispersion plot")

Figure: The red line in the figure plots the estimate for the expected dispersion value for genes of a given expression strength. Each black dot is a gene with an associated mean expression level and maximum likelihood estimation (MLE) of the dispersion.


Heatmap * For clustering and heatmaps, we need to log transform our data:

rld <- rlogTransformation(dds) #log of data from "dds"
#loading librarys for plot
library(RColorBrewer)
library(gplots)
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(samples$condition))])
sampleDists <- as.matrix(dist(t(assay(rld))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          ColSideColors=mycols[samples$condition],
          RowSideColors=mycols[samples$condition],
          margin=c(10, 10), main="Sample Distance Matrix")

Figure: Differential Expression Analysis: Comparison between each samples as a heatmap, black is identical, white not identical


PCA Plot

DESeq2::plotPCA(rld, intgroup="condition")


MA plot and the Volcano Plot

table(res$padj<0.05) #extract data out of res
res <- res[order(res$padj), ] #sorting the res variable
#merging data so its ready for plots
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
hist(res$pvalue, breaks=50, col="grey") #a hist plot
DESeq2::plotMA(dds, ylim=c(-1,1), cex=1) #an MA plot
# Volcano plot
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))
# second one highlighting certain points
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))

KEGG pathway analysis

KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction, reaction and relation networks for: * We need these library's:

library(AnnotationDbi)
library(org.Hs.eg.db)
library(pathview)
library(gage)
library(gageData)
res$symbol <- mapIds(org.Hs.eg.db, keys=row.names(res), column="SYMBOL", keytype="ENSEMBL", multiVals="first")
res$entrez <- mapIds(org.Hs.eg.db, keys=row.names(res), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
res$name <- mapIds(org.Hs.eg.db, keys=row.names(res), column="GENENAME", keytype="ENSEMBL", multiVals="first")
data(kegg.sets.hs)
data(sigmet.idx.hs)
kegg.sets.hs <- kegg.sets.hs[sigmet.idx.hs]
head(kegg.sets.hs, 3)
foldchanges <- res$log2FoldChange
names(foldchanges) <- res$entrez
keggres <- gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
lapply(keggres, head)
library(dplyr)
# Get the pathways
keggrespathways <- data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()
keggrespathways
# Get the IDs.
keggresids <- substr(keggrespathways, start=1, stop=8)
# Define plotting function for applying later
    plot_pathway <- function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)

# Unload dplyr since it conflicts with the next line
    detach("package:dplyr", unload=T)
# plot multiple pathways (plots saved to disk andreturns a throwaway list object)
    tmp <- sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))

The results are shown here (but only for 2 pathways and only the KEGG output):

Another tutorial about this pathway stuff can be found here.