IsoformEx: Isoform level gene expression estimation using non-negative least squares from mRNA-Seq data

(updated on Apr. 22, 2011)

What is it doing?

IsoformEx estimates transcript expression levels and gene expression levels from mRNA-Seq data. Technically speaking, IsoformEx parses bowtie alignment files in a project directory (e.g. ~yourid/isoformex/xxx, where xxx is the project name) and generates two files: (1) xxx/xxx_transcript_1.txt: expression levels of all transcripts, (2) xxx/xxx_gene_1.txt: expression levels of all genes.

System Requirement

Software License

You can use this software for academic study at your own risk. Commercial users should contact with authors. All Rights Reserved.

Simple Installation and Test Step

Tutorial: When you have a FASTQ file

Let's assume we have sampleproj.fastq (project name: sampleproj) obtained from NGS platform and you already installed Bowtie. If you did not install it yet, you can download and install Bowtie. The other files (e.g. Bowtie index files for hg18 and mm9 genome and splice junctions) for this tutorial step were already embedded in the tgz file. Now, you may understand file structure of IsoformEx package. When you have your own fastq file, you can make additional project directory (let's say xxx), copy above scripts (run_bowtie_step1, run_bowtie_step2) to xxx, edit them for your usage. After getting bowtie output files by these scripts, you can copy the IsoformEx execution script (run_isoformex_sampleproj) to run_isoformex_xxx, edit and execute it. Final result files will be located at your project directory ./xxx.

Tutorial: When you have alignement result files (Bowtie/SAM format)

In the project directory, you need have a set of bowtie alignment files or a set of SAM files. The file name should be (xxx.bowtie, xxx_sj.bowtie) or (xxx.sam, xxx_sj.sam), where 'xxx' is the project name and the project directory name. One is for genomic alignment, and the other is splice junction alignment. Let's suppose that the project name is 'mcf7'. You may have alignment files generated by bowtie.

Advanced: When you have your own transcript model

First of all, you need make a tab-delimited transcript model file. Examples can be found in the directory of 'transcript_models'. The file format of transcript model files is similar to the format of gene definition files in the UCSC database (more specifically, two more columns in addition to the UCSC format).
Column Specification
col1: transcriptID
col2: chromosome (e.g. chr1, ..., chrX, chrY), but do not include chrM, chr?_random, chr?_*_hap?
col3: strand information (+ or -)
col4: start position (0-based)
col5: end position 
col6: coding start position (0-based, for ncRNA, col6<-col4)
col7: coding end position (for ncRNA, col7<-col4)
col8: # of exons
col9: exon start positions (0-based)
col10: exon end positions 
col11: UniProt accession (e.g. Q9BV57) or RefSeq protein ID (NP_xxx)
col12: transcriptID (duplicated info. but, specify it because of convention)
col13: gene symbol (additional information, it is neccessary)
col14: Entrez GeneID (number)
When there is no Entrez GeneID for a transcript, set 0 to col14. Here is an example of a line in the file.
uc002qxp.2	chr2	-	3480696	3502354	3481740	3502262	4	3480696,3483591,3496635,3502142,	3481860,3483771,3496755,3502354,	Q9BV57	uc002qxp.2	ADI1	55256
Ok, now you need to copy your transcript model file into the directory of 'transcript_models'.

Because you have your own transcript model, you cannot use the standard scripts for generating alignment result files, instead, you need to make your own alignment file for splice junction detection although the mapping to genome is fine. In other words, splice junctions are changed due to the change of exon information. The alignment files (bowtie or SAM format) are the input of IsoformEx, which you need to prepare before excuting IsoformEx. But, splice junction locations in the alignement files should follow a rule for IsoformEx. Here is a line of the SAM file having splice junction information.
SRR015274.16    16      chr16:14951437-14951468:14951578-14951609       6       0       32M     *       0       0       CGGGCAGAGGACTACTACAGATGCAAAATCAC       CUYYIYYYYYYOYYRYYMYYYYYTYYYYYYYY        XA:i:0  MD:Z:32 NM:i:0  XM:i:2
As you see, there is a long splice junction information (format: chromosomeName:a-b:c-d, where 'b' is the end of the first exon, 'c' is the start of the second exon of the junction, and 'a' and 'd' are determined to make (b-a+1=32bp) and (d-c+1=32bp)) in the 3rd column instead of chromosome information.

This is the location of splice junction. The overlapping length of the left exon is (len1=32bp=b-a+1). The overlapping length of the right exon is (len2=32bp=d-c+1). When the left exon is smaller than 32bp, 'a' will be the start of the first exon and b-a+1 is equal to the length of the first exon, which is less than 32bp. The tag mapping location is located at the 4th column, i.e. 6. Thus, this tag starts from a+6-1 when a+6-1 ≤ b. In order to get this alignment file, you need to make a splice junction fasta file having the splice junction information with the same format (chromosomeName:a-b:c-d) and splice junction sequence in the location. And you can map tags into the fasta file. Here is an example of splice function fasta file.
>chr1:2059-2090:2476-2507
GAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCA
>chr1:2059-2090:3084-3115
GAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCC
...
In order to map tags into these sequences, a new index should be built. If you use bowtie, you can see the documentation of building a new index. Once you have alignment files generated by an aligner, you can execute IsoformEx. Let's suppose we have generaged a splice junction mapping file for mcf7 project (mcf7_sj.sam). Let's execute IsoformEx.
mcf7/mcf7.sam
mcf7/mcf7_sj.sam
nohup ./run_isoformex.sh MCR/v713 transcript_models/your_transcript_model.bed mcf7 > mcf7/mcf7.log &
That's It!!!

FAQ


Contact information: Hyunsoo Kim (hkim@wistar.org)