1.5. Full transcription sequencing
1.6. Single-cell transcriptome sequencing
2. Construction of Transcriptome Sequencing Library
3. Transcriptome Data Processing
When transcript sequencing data is used to compare quantitative differences between gene levels or transcript levels between different groups, the basic analysis process includes raw data preprocessing, reads comparison, transcript assembly, new transcript prediction, and transcript expression level, analysis and other steps. According to the purpose of the experiment, we can further analyze the difference in transcript expression between the experimental group and the control group, cluster gene expression patterns between samples, and perform joint analysis with other omics data.
3.1. Raw data preprocessing
After obtaining the raw data of the second-generation sequencing, the quality of the data needs to be evaluated and quality control (QC) is performed. The evaluation content includes data output, GC content, rRNA content, base quality distribution, and repeated sequences. The low-quality reads and linker sequences are removed, and the clean data after quality control is obtained for subsequent analysis.
3.2. Reads comparison
The transcriptome data is mainly derived from the exon sequence of the genome. The transcriptome reads obtained by sequencing are aligned to the genomic sequence, which will be separated by the intron sequence.
3.3. Transcript assembly
3.4. Transcript prediction
Most genes have multiple splicing forms and may produce multiple transcripts, thereby encoding different proteins, which may cause a gene to have multiple functions. After splicing and assembling the transcript sequencing data, not only will you get the known transcript information, but also new transcript sequences, you need to identify and annotate the new transcripts, especially the new ones that are less studied ncRNA transcript.
For species with reference genome and transcript reference information, the transcript structure is mainly based on sequencing to obtain reads for comparison. The reads cover all transcript sequences and rely on the genome sequence to assemble complete transcript information. For species without a reference genome, the transcript sequence of the gene needs to be assembled by itself. The obtained gene or transcript sequence can be compared with unigene and EST databases of the same species or near-source species to judge the reliability of the obtained gene or transcript sequence. In this process, the blast method is commonly used for comparison to quickly identify the similarity between sequences. In the identification and analysis of new lncRNA, transcripts with a total exon length of> 200 nt are extracted from the transcriptome data based on the characteristics of lncRNA molecules, and then predicted based on open reading frames and compared with known protein databases Further isolate lncRNA from mRNA.
3.5. The analysis of transcript expression levels
After comparing the reads to the corresponding genomic position or assembling the transcript from scratch, the number of reads on each gene or transcript can reflect the expression abundance to a certain extent. There may be significant differences in the total output of data between samples, the number of gene expressions between samples, the length of different genes in a sample, or even the distribution of different transcripts within the same gene. When comparing expression levels, you need to normalize the data between samples.
To be continued in Part III…