IsoformEx: Isoform level gene expression estimation using non-negative least squares from mRNA-Seq data
(updated on Apr. 22, 2011)
What is it doing?
IsoformEx estimates transcript expression levels and gene expression levels from mRNA-Seq data. Technically speaking, IsoformEx parses bowtie alignment files in a project directory (e.g. ~yourid/isoformex/xxx, where xxx is the project name) and generates two files: (1) xxx/xxx_transcript_1.txt: expression levels of all transcripts, (2) xxx/xxx_gene_1.txt: expression levels of all genes.
System Requirement
- CPU: Intel 64bit (Recommended)
- RAM: 8GB or above
- HDD: 10GB or above free disk space
- Operating System: Linux 64bit
- This software, IsoformEx, was tested in the following systems
- Intel Xeon X5460, 36GB RAM, Redhat Linux x86_64
- Intel Xeon E5530, 28GB RAM, Redhat Linux x86_64
Software License
You can use this software for academic study at your own risk.
Commercial users should contact with authors. All Rights Reserved.
Simple Installation and Test Step
-
cd ~yourid
Let's work at your home directory.
-
Download isoformex_linux64.tgz to the current directory. (File size: ~ 7GB, Downloading time depends on your location and current network status. It may take about 10-20min to download this tgz file.)
-
tar xvfz isoformex.tgz
cd isoformex
-
./run_isoformex
or
nohup ./run_isoformex.sh MCR/v713 transcript_models/hg18_ucsc_transcript_model.bed mcf7 > mcf7/isoformex.log &
That's it!!! IsoformEx is now executing. Take a coffee break and see results in the subdirectory ./mcf7 after an hour. This command will launch our software, read a transcript model file (transcript_models/hg18_ucsc_transcript_model.bed) and two alignment files (two bowtie output files, mcf7/mcf7.bowtie, mcf7/mcf7_sj.bowtie), and generate two files (mcf7/mcf7_transcript_1.txt for transcript expression levels, and mcf7/mcf7_gene_1.txt for gene expression levels).
-
If you face any problem in above step, you may try to install MCR by MCRInstaller.bin and repeat above steps. Download MCRInstaller.bin to ~yourid/isoformex directory and install it to ~yourid/isoformex/MCR (destination directory) by using the following guide. If you still face a problem, please contact with Hyunsoo Kim (hkim@wistar.org) with a log file (mcf7/isoformex.log) and problem descriptions.
cd ~yourid/isoformex
./MCRInstaller.bin -console
The InstallShield Wizard will install MATLAB(R) Compiler Runtime 7.13 on your computer.
To continue choose Next.
Press 1 for Next.
Please specify a directory (MCR) to install MCR to current directory.
Destination Directory [/opt/MATLAB/MATLAB_Compiler_Runtime] MCR
Press 1 for Next
MATLAB(R) Compiler Runtime 7.13 will be installed in the fllowing location: ./MCR
Press 1 to Next
Installing MATLAB(R) Compiler Runtime 7.13. Please wait...
Press 3 to Finish
Tutorial: When you have a FASTQ file
Let's assume we have sampleproj.fastq (project name: sampleproj) obtained from NGS platform and you already installed Bowtie. If you did not install it yet, you can download and install Bowtie. The other files (e.g. Bowtie index files for hg18 and mm9 genome and splice junctions) for this tutorial step were already embedded in the tgz file.
Now, you may understand file structure of IsoformEx package. When you have your own fastq file, you can make additional project directory (let's say xxx), copy above scripts (run_bowtie_step1, run_bowtie_step2) to xxx, edit them for your usage. After getting bowtie output files by these scripts, you can copy the IsoformEx execution script (run_isoformex_sampleproj) to run_isoformex_xxx, edit and execute it. Final result files will be located at your project directory ./xxx.
Tutorial: When you have alignement result files (Bowtie/SAM format)
In the project directory, you need have a set of bowtie alignment files or a set of SAM files. The file name should be (xxx.bowtie, xxx_sj.bowtie) or (xxx.sam, xxx_sj.sam), where 'xxx' is the project name and the project directory name. One is for genomic alignment, and the other is splice junction alignment. Let's suppose that the project name is 'mcf7'. You may have alignment files generated by bowtie.
mcf7/mcf7.bowtie
mcf7/mcf7_sj.bowtie
Or, you may have SAM files generated by an aligner.
mcf7/mcf7.sam
mcf7/mcf7_sj.sam
Note that the file name should have project name. If you have both bowtie files and SAM files, IsoformEx uses a set of bowtie files.
Now, you have alignment files, and you can execute IsoformEx.
nohup ./run_isoformex.sh MCR/v713 transcript_models/hg18_ucsc_transcript_model.bed mcf7 > mcf7/mcf7.log &
That's It!!!
Advanced: When you have your own transcript model
First of all, you need make a tab-delimited transcript model file. Examples can be found in the directory of 'transcript_models'. The file format of transcript model files is similar to the format of gene definition files in the UCSC database (more specifically, two more columns in addition to the UCSC format).
Column Specification
col1: transcriptID
col2: chromosome (e.g. chr1, ..., chrX, chrY), but do not include chrM, chr?_random, chr?_*_hap?
col3: strand information (+ or -)
col4: start position (0-based)
col5: end position
col6: coding start position (0-based, for ncRNA, col6<-col4)
col7: coding end position (for ncRNA, col7<-col4)
col8: # of exons
col9: exon start positions (0-based)
col10: exon end positions
col11: UniProt accession (e.g. Q9BV57) or RefSeq protein ID (NP_xxx)
col12: transcriptID (duplicated info. but, specify it because of convention)
col13: gene symbol (additional information, it is neccessary)
col14: Entrez GeneID (number)
When there is no Entrez GeneID for a transcript, set 0 to col14. Here is an example of a line in the file.
uc002qxp.2 chr2 - 3480696 3502354 3481740 3502262 4 3480696,3483591,3496635,3502142, 3481860,3483771,3496755,3502354, Q9BV57 uc002qxp.2 ADI1 55256
Ok, now you need to copy your transcript model file into the directory of 'transcript_models'.
Because you have your own transcript model, you cannot use the standard scripts for generating alignment result files, instead, you need to make your own alignment file for splice junction detection although the mapping to genome is fine. In other words, splice junctions are changed due to the change of exon information.
The alignment files (bowtie or SAM format) are the input of IsoformEx, which you need to prepare before excuting IsoformEx. But, splice junction locations in the alignement files should follow a rule for IsoformEx.
Here is a line of the SAM file having splice junction information.
SRR015274.16 16 chr16:14951437-14951468:14951578-14951609 6 0 32M * 0 0 CGGGCAGAGGACTACTACAGATGCAAAATCAC CUYYIYYYYYYOYYRYYMYYYYYTYYYYYYYY XA:i:0 MD:Z:32 NM:i:0 XM:i:2
As you see, there is a long splice junction information (format: chromosomeName:a-b:c-d, where 'b' is the end of the first exon, 'c' is the start of the second exon of the junction, and 'a' and 'd' are determined to make (b-a+1=32bp) and (d-c+1=32bp)) in the 3rd column instead of chromosome information.
This is the location of splice junction. The overlapping length of the left exon is (len1=32bp=b-a+1). The overlapping length of the right exon is (len2=32bp=d-c+1).
When the left exon is smaller than 32bp, 'a' will be the start of the first exon and b-a+1 is equal to the length of the first exon, which is less than 32bp.
The tag mapping location is located at the 4th column, i.e. 6. Thus, this tag starts from a+6-1 when a+6-1 ≤ b.
In order to get this alignment file, you need to make a splice junction fasta file having the splice junction information with the same format (chromosomeName:a-b:c-d) and splice junction sequence in the location. And you can map tags into the fasta file.
Here is an example of splice function fasta file.
>chr1:2059-2090:2476-2507
GAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCA
>chr1:2059-2090:3084-3115
GAGAGCATCAACTTCTCTCACAACCTAGGCCAGCAGGGCCATCAGGCACCAAAGGGATTCTGCC
...
In order to map tags into these sequences, a new index should be built. If you use bowtie, you can see the documentation of building a new index.
Once you have alignment files generated by an aligner, you can execute IsoformEx. Let's suppose we have generaged a splice junction mapping file for mcf7 project (mcf7_sj.sam). Let's execute IsoformEx.
mcf7/mcf7.sam
mcf7/mcf7_sj.sam
nohup ./run_isoformex.sh MCR/v713 transcript_models/your_transcript_model.bed mcf7 > mcf7/mcf7.log &
That's It!!!
FAQ
Contact information: Hyunsoo Kim (hkim@wistar.org)