RNA-Seq Analysis

RNA-seq is an approach to estimate transcript abundance by sequencing the transcriptome of a cell type or tissue. It is primarily used to identify genes that are either upregulated or downregulated in relation to a control.

Deliverables

Differential Gene Expression: For each comparison, we provide a spreadsheet containing all the genes from the annotation file. Included in the spreadsheet is the average normalized gene expression value for each gene along with the log fold change, p-value, and FDR-adjusted p-value.

Principle Component Analysis (PCA): PCA is an analytical technique that clusters samples based on their gene expression profiles. We run a PCA analysis on all samples in the experiment to evaluate how similar biological replicates are to one another. There is also a PCA analysis performed for each individual comparison.

Hierarchical Clustering: Similar to a PCA, hierarchical clustering arranges samples on a dendrogram according to their gene expression profiles. This is another effective way to evaluate how similar samples are to one another. Clustering is also performed for each individual comparison.

Heatmap of Differentially Expressed Genes: For each comparison, a heatmap is created containing only differentially expressed genes. Differential expression is determined using an FDR-adjusted p-value less than 0.05. However, if there are fewer than 20 genes based on the FDR-adjusted value, the heatmap will include genes with a regular p-value less than 0.05.

Volcano Plot: This is a plot of the log2 fold change versus the –log10(p-value). Genes that are significantly upregulated or downregulated are highlighted red.

Optional Analysis

Motif Analysis

Given a list of differential genes, a motif analysis will extract their known promoters and search for similar sequences. This predictive analysis might identify possible transcription factors that may be responsible for the observed changes in gene expression.

Small RNA-Seq / microRNA-Seq

In a small RNA-seq (or microRNA-seq) experiment, short (< 200 nt) transcripts are selected during library construction. The workflow is similar to an mRNA-seq experiment with an additional option. Instead of aligning to the reference genome, reads from a small RNA-seq library may be aligned directly to the mature small RNA sequences, such as those found in miRBase. After alignment, a count file is generated which represents the number of reads that have aligned to each small RNA. From there, differential expression can be performed on the small RNA count files.

Non-coding RNA-Seq / Total RNA-Seq

Small RNAs are non-coding, but because of their short length, they must be processed differently during library construction. However, other non-coding transcripts can be assessed by RNA-seq, particularly long non-coding RNA. Since many non-coding RNA molecules lack a poly-A tail, ribosomal RNA is removed with capture probes rather than with oligo-dT columns.

What quality checks are performed on RNA before library construction?

Each RNA sample that is submitted to the core undergoes a Bioanalyzer analysis to determine the integrity of the RNA. A RIN score is generated for each sample, ranging from 0 to 10. A RIN score of 7 or higher indicates that the RNA sample is of sufficient quality to proceed with library construction.

How much does an RNA-seq project cost?

The billing of most RNA-seq projects can be divided into 3 parts. First is library construction. The charge for library construction is per sample, and depends on how much RNA is in your sample. The second part is sequencing, which is charged per lane or per flowcell as opposed to per sample. Because the charge is per lane or flowcell, the more samples that are sequenced in the project will result in lower per-sample costs. The third part is analysis. This is charged by the hour and the cost will depend on the type and extent of analysis requested.

What does each column of the differential expression spreadsheet mean?

Ensembl ID: the gene ID provided by Ensembl.

Gene Symbol: the symbol that is generally attributed to the particular gene.

Entrez ID: the gene number assigned by Entrez, which is part of NCBI.

Description: the gene name, which is more descriptive than the gene symbol.

Location: the chromosomal location of the gene, from beginning to end including introns.

Strand: the strand on which the gene is found, (+ or -).

Log2 Fold Change: the log2 fold change (LFC) of the gene in the experimental group relative to the control group. Positive LFC values indicate upregulation relative to control, while negative LFC values indicate downregulation. In naming the spreadsheet, the control group is listed first and the experimental group second, such as “Control_vs_Experimental_Differential_Expression.xlsx”.

It’s important to note that DESeq2 computes LFC values differently than other programs. DESeq2 applies a shrinkage estimation that reduces initial LFC calculations depending on the expression of the gene. Therefore, highly expressed genes are reduced slightly, while lowly expressed genes are reduced more.

LFC Standard Error: because the LFC values are determined through a shrinkage estimation, the LFC Standard Error exists to express the confidence in the LFC calculation.

Wald Statistic: the statistic used to calculate the p-value, much like a t-statistic for t-tests. The greater the difference between the Wald statistic and 0, the lower the p-value.

p Value: a p-value is calculated for every gene, reflecting the probability of incorrectly rejecting the null hypothesis that the mean expression of the gene is not different between the two groups.

FDR Adj p Value: the p-value after correction for the false discovery rate. This is the p-value most often used to determine significance.

Significant: a simple yes/no designation whether the FDR-adjusted p-value is less than 0.05.

Status: an indication of whether there were sufficient reads to confidently determine the expression of the gene. This field will be “OK” if there are at least ten reads mapping to the gene, “LOW” otherwise.

Base Mean: the mean expression of the gene across both groups.

Control Group Mean: the mean expression of the gene in the control group. The label of the actual control group will be displayed in the spreadsheet.

Experimental Group Mean: the mean expression of the gene in the experimental group. The label of the actual experimental group will be displayed in the spreadsheet.

Individual Expression: the remainder of the columns indicate the gene expression in each sample.

What is the difference between the regular p-value and the FDR-adjusted p-value, and which one should I use?

During differential gene expression, the mean expression values of each gene are compared between each group in a test similar to a t-test. As a result, a p-value is calculated for each test/gene. A p-value cutoff of 0.05 indicates that there is a 5% of incorrectly rejecting the null hypothesis. For a single test, this is acceptable. However, an RNA-seq analysis involves several thousands tests, one for each gene. Therefore, as more tests are performed, the likelihood that some null hypotheses are incorrectly rejected increases. For example, image a situation in which there are 1,000 differentially expressed genes, all with a p-value of 0.05. On average, we can expect 5%, or 50 genes, are found to be significant when they are actually not. Therefore, each p-value is adjusted using a calculated false discovery rate (FDR), which is analogous to a post-hoc test. The most common cutoff to determine significance is an FDR-adjusted p-value of 0.05 or less.

Does NUSeq offer a pathway analysis too?

Yes. A pathway analysis will identify common functions or pathways among significant genes. Most pathway analysis tools are free, web-based, and simply require a list of genes. One pathway analysis tool that we use often in NUSeq is MetaScape, which is fast and easy to use.

Can NUSeq help me submit my project to GEO (Gene Expression Omnibus)?

Yes. If you wish to submit your RNA-seq data to GEO, first download and complete this spreadsheet up to and including the row labeled "extract protocol". This is row 29 on a blank form, but it may change if additional rows are added above it. For guidance, consult the tab labeled "EXAMPLE 2" on the spreadsheet. The core will complete the rest of the form. If the sequencing was performed in NUSeq, then we already possess the raw data, but if it was sequenced elsewhere, the raw FASTQ files must be provided to the core. After completing the spreadsheet, submit it to the core along with the following:

The GEO user account. Every GEO submission must be associated with a GEO account, which will be designated as the owner of the data. Individuals within NUSeq should not use their own GEO accounts for this purpose, because we do not own the data.
The date on which you would like the data made public. This date should be far out enough to ensure that the manuscript will be published. A good recommendation is between 6 and 9 months. The release date may be changed, so a good suggestion would be to over-estimate, and then release the data to the public earlier if necessary.

The genome of my organism is not well characterized and the genes are not well annotated, how will that impact the analysis?

Many bioinformatics analyses depend on a well-annotated reference genome. These are usually readily available for model organisms such as human and mouse. If your organism is not well annotated, the results may be difficult to interpret. For example, the final results of an RNA-seq project will use the gene names provided in the gene annotation file. If the gene annotation file uses gene names that are unfamiliar to the user, the results will be difficult to interpret. For non-model organisms, we suggest becoming familiar with the gene models available and indicating to NUSeq exactly which gene model you would prefer us to use. If you're unsure, please contact us.

Can I use RNA-seq to identify differential isoforms or alternative splicing?

Yes, but there are caveats. NUSeq currently only houses "short read" sequencing instruments. Short read sequencing platforms are suitable for measuring gene expression, but are not ideal for isoform expression. If you require isoform expression, a long read sequencing platform such as Nanopore or PacBio is preferable. However, if only short read platforms are available, we suggest sequencing with PE100 or PE150 with 50 million to 100 million reads per sample. There are bioinformatics workflows available that will accommodate isoform expression, but their reliability is not certain and we recommend validating any interesting results.

Does NUSeq perform RNA extraction too?

No. NUSeq requires total RNA to be submitted for RNA-seq services (with the exception of single cell RNA-seq). However, RNA extraction is offered by the Pathology Core.