Frequently Asked Questions – Illumina
Illumina Proposal and sample submission questions
What services does the Sequencing Facility provide?
Who can order services through the Sequencing Facility?
How do I submit a sequencing proposal?
How do I obtain a CCR-SF LIMS account?
How to I submit an order to the CCR-SF LIMS?
How do I submit samples?
What are the quality requirements for submitted samples?
What happens after my sample is submitted?
How can I check the status of my order/sample?
Illumina Bioinformatics questions
What analyses does the SF perform?
What software programs does the SF use to perform analysis?
What types of file data/formats will I receive from SF?
How can I view the sequence data in the sorted BAM files?
What are some resources for viewing mapped data?
How do I convert BAM to FastQ files?
What type of files will I receive for exome sequencing variant calling from SF?
What are the options used in the variant calling process for exome sequencing?
What type of filtering is performed during the variant calling process for exome sequencing?
How large are the files?
How is the file data delivered?
How do I analyze the data?
How long is the data available?
Please see the services page for a detailed list of projects we support. If your project design is not listed, please contact Bao Tran, the sequencing facility director, to discuss the feasibility of a custom project.
Contact Bao Tran to find out if your research lab is eligible.
Please complete a sequencing proposal form (available under Protocols and Resources) and submit it to firstname.lastname@example.org. You may also contract Bao Tran to discuss the available platforms and best choice for your project.
The PI heading the project should email email@example.com with the names of the people needing to be added to your lab group.
The LIMS user guide, downloadable from the Protocols and Resources page, contains step-by-step instructions for order submission. If this is your first time placing an order, you must also email the bioinformatics group through firstname.lastname@example.org to set up your user account and arrange for a live demonstration in your laboratory.
Before submitting samples, ensure that the sequencing proposal has been approved and the CSAS request submitted. You may then submit samples by delivering them to Jyoti Shetty at the ATRF Room D3040. Be sure to include a sample manifest with your submission.
- ChIP DNA Sequencing: minimum DNA/RNA 10ng, 20ng optimal. maximum sample volume 30μL. DNA should be as intact as possible with no contamination.
- gDNA Sequencing: minimum DNA/RNA 1μg, 3μg optimal. maximum sample volume 30μL. DNA should be as intact as possible with no contamination.
- mRNA Sequencing: minimum DNA/RNA 1μg, 3μg optimal. maximum sample volume 50μL. RIN should be > 8.0.
- miRNA Sequencing: minimum DNA/RNA 5μg, 10μg optimal. maximum sample volume 50μL. RIN should be > 8.0.
- Mate-pair Sequencing: minimum DNA/RNA 10μg, 15-20μg optimal. maximum sample volume 70μL. DNA should be as intact as possible with no contamination.
Before sequencing, we will perform an internal QC to confirm the information in the sample manifest and notify you if any samples do not meet minimum sequencing requirements. You will then be able to choose whether to resubmit those samples or continue and sequence them. You will be notified again when the analysis on each sample is completed and available for download.
Currently, we are performing the primary analysis including QC, basecalling and alignment of each sequence to its reference. For exome sequencing projects only, we also perform variant calling using illumina CASAVA software. We also ensure that each sequence delivered meets minimum standards for yield, basecall quality, percent aligned, and errors allowed.
We currently use vendor provided CASAVA 1.8.2 for sequence alignment and variant calling (for exome-seq only). In addition, we use Bowtie software for projects that require additional mapping. We also use Picard, fastx toolkit, fastqc and in-house scripts for quality assessment.
You will receive the illumina pass filtered sequence data in fastq format archived in to one file and alignment data BAM files. The BAM file contains basecalls and quality score information for all pass filtered reads as well as alignment information for reads that have mapped to the reference sequence.
Since the BAM files are in binary, to view the raw sequence file in the command window you will need to use freeware programs such as Samtools or Picard. These are very good for doing basic manipulation of BAM files. Once the program is installed, you can use a command such as
username@host:$ samtools view <path-to-file.bam> | less
to view the read data. Each line of the file represents one read and will be in tab-delimited SAM format, like the following example:
HWI-ST165_283:5:2304:12571:70958 80 chr1 135119 12 100M * 0 0 GGAGGCCGCAGTGAGGCGAGAGCTAGCTGGGCGTGGAGAGTCCGCTGTGAGGCCGAGGCCGAGGCTGGGCCCGTGCAGGCCTTCGAGACGCAGGAGGCCG ################################CCCCCC=6BB>?@3;@>8BD<:;@-(<F@@B8=BGIIIIDIHDDDEBE=GGCB:F@@3HDDB?DD@?@ BC:Z:0 XD:Z:100 SM:i:12
The following table describes the Illumina-specific contents of each field:
|1||QNAME||Query name or Query paired name, using the following format:|
|<Machine>_<Run number>:<Lane>:<Tile>:<X coordinate of cluster>:<Y coordinate of cluster>|
|2||FLAG||Bitwise flag describing the orientation, mapping status, and pair status of the read. The number that appears is cumulative for several potential flags; the Picard site has a useful tool for interpreting this number.|
|3||RNAME||Name of the matching reference sequence or chromosome|
|4||POS||1-based leftmost POSition/coordinate of clipped sequence|
|5||MAPQ||MAPping Quality/Q-score (Phred-scaled, indicates probability that the mapping position is incorrect)|
|6||CIGAR||Extended CIGAR string (describes location of alignment to the reference and location of any indels)|
|7||MRNM||Mate Reference sequence NaMe (‘=’ if same as RNAME)|
|8||MPOS||1-based leftmost Mate POSistion|
|9||ISIZE||Inferred insert size|
|10||SEQ||query SEQuence on the same strand as the reference|
|11||QUAL||query QUALity string (ASCII-33 gives the Phred base quality)|
|12+||OPT||OPTional fields for each read. The fields are formatted as <TAG>:<ValueType>:<Value> and may list alignment scores, mismatch positions, and unmapped read information|
The UCSC genome browser and Integrative Genomics Viewer are two popular tools to view mapped sequence data. To view the data, you will need both the sequence data file (sorted.bam) and the index file (sorted.bam.bai). You will only need the reference file (sorted.bam.fa.gz) if you are using a custom reference genome.
For labs that desire to return the BAM data to FastQ format, here are two examples of file conversion using open source software tools:
A. Using the Picard java program:
1. Download Picard software tools from http://sourceforge.net/projects/picard/files/picard-tools/
2. Unzip the compressed source file: unzip picard-tools-version-num.zip
Please note, you need to download and install the Java Run Time Environment if you don’t have it in your server. Type java –version from your terminal can tell you whether the Java Runtime Environment is installed or not.
3. There are several jar files in the Picard java package. Each jar file is a separate command line tool. Find the jar file named as SamToFastq.jar. Run the SamToFastq.jar converter:
4. To extract all reads for single read run:
java –Xmx2g –jar picard-tools-version-num/SamToFastq.jar INPUT=input.bam FASTQ=read1.fastq INCLUDE_NON_PF_READS=true 2>run.err
5. To extract pass filter reads for paired end run:
java –Xmx2g –jar picard-tools-version-num/SamToFastq.jar INPUT=input.bam F=read1.fastq F2=read2.fastq 2>run.err
6. To extract all reads for paired end run:
java –Xmx2g –jar picard-tools-version-num/SamToFastq.jar INPUT=input.bam F=read1.fastq F2=read2.fastq INLCUDE_NON_PF_READS=true 2>run.err
For more information about Picard software tool, please visit the software website at: http://picard.sourceforge.net/index.shtml
B. Using the bam2fastq command line software tool:
1. Download bam2fastq software tools at http://www.hudsonalpha.org/gsl/software/bam2fastq.php
2. Download and install the prerequisite library such as zlib.
3. Unzip and install the source code: tar –xvzf bam2fastq-version-num.tgz
4. cd to the bam2fastq-version-num directory and type “make”
5. Run the converter:
6. To extract all reads for single read run:
bam2fastq -o output.fastq input.bam –no-filtered 2>run.err
7. To extract all reads for paired-end run:
Bam2fastq –o output_read#.fastq input.bam –no-filtered 2>run.err
8. To extract pass filtered reads and aligned reads only for paired reads:
bam2fastq -o output_read#.fastq input.bam –aligned 2>run.err
For more details about each command options, please check http://www.hudsonalpha.org/gsl/software/bam2fastq.php for information.
- Summary statistics files: snps.summary.txt, indels.summary.txt, dupCount.summary.txt, coverage.summary.txt
- snps.txt file per sample
- indels.txt per sample
The following table describes the contents of snps.summary.txt file
|2||hom||Number of homozygous snps|
|3||het||Number of heterozygous snps|
|4||het_nonref||Numer of heterozygous non reference snps|
The following table describes the contents of indels.summary.txt file
|2||insertion||Number of insertions|
|3||deletion||Number of deletions|
|4||breakpoints||Number of breakpoints *|
|5||het||Number of heterozyhous indels|
|6||hom||Number of homozygous indels|
* Breakpoint calls correspond to one of the two “ends” of a large indel or structural variant, for which the complete variant is either unknown or cannot be represented by the small variant caller. Breakpoint events are reported as either left or right breakpoints.
A left breakpoint corresponds to a haplotype which can be mapped on the left side of the breakpoint location but not on the right. A right breakpoint indicates that a haplotype can be mapped to the right of the breakpoint location but not to the left. If a simple insertion or deletion were represented as two breakpoint calls, then they would occur on the forward strand as a left breakpoint followed by a right breakpoint.
The following table describes the contents of dupCount.summary.txt file
|2||nonduplicate_pairs||nonduplicate reads pairs aligned to the given chromosome|
|3||duplicate_pairs||duplicate pairs aligned to the given chromosome|
The following table describes the contents of snps.txt file
|1||seq_name||Reference sequence label|
|2||pos||Sequence position of the site/snp|
|3||bcalls_used||Basecalls used to make the genotype call for this site|
|4||bcalls_filt||Basecalls mapped to the site but filtered out before genotype calling|
|6||Q(snp)||A Q-value expressing the probability of the homozygous reference genotype, subject to the expected rate of haplotype difference as expressed by the (Watterson) theta parameter *|
|7||max_gt||The most likely genotype (subject to theta, as above).|
|8||Q(max_gt)||A Q-value expressing the probability that the genotype is not the most likely genotype above (subject to theta).|
|9||max_gt|poly_site||The most likely genotype assuming this site is polymorphic with an expected allele frequency of 0.5 (theta is still used to calculate the probability of a third allele — i.e. the chance of observing two non-reference alleles).|
|10||Q(max_gt|poly_site)||A Q-value expressing the probability that the genotype is not the most likely genotype above assuming this site is polymorphic.|
|11||A_used||‘A’ basecalls used|
|12||C_used||‘C’ basecalls used|
|13||G_used||‘G’ basecalls used|
|14||T_used||‘T’ basecalls used|
* The parameter theta as used in the variant calling model refers to the expected proportion of differing sites between two chromosomes sampled from the population. For site-genotyping, it is set by default to 1/1000, a value appropriate for human resequencing. Raising this value, to e.g. 1/100, would have the effect of increasing the prior expectation of a non-reference genotype and increase Q(snp) values. The parameter theta for indels is set to a default value of 1/10,000.
The following table describes the contents of indels.txt file
|1||seq_name||Reference sequence label|
|2||pos||Except for right-side breakpoints, the reported start position of the indel is the first (left-most) reference position following the indel breakpoint. For right-side breakpoints the reported position is the right-most position preceding the breakpoint. Also note that wherever the same indel could be represented in a range of locations, the caller attempts to report it in the left-most position possible.|
|3||type||String summarizing the indel type. One of:
• nI—Insertion of length n (e.g. 10I is a 10 base insertion)
• nD—Deletion of length n (e.g. 10D is a 10 base deletion)
• BP_LEFT—Left-side breakpoint
• BP_RIGHT—Right-side breakpoint
|4||ref_upstream||Segment of the reference sequence 5’ of the indel event. For right-side breakpoints this field is set to the value ‘N/A’.|
|5||ref/indel||Equal length sequences corresponding to the reference and indel alleles which span the indel event. The character ‘-‘ indicates a gap sequence of the reference or the indel allele.|
|6||ref_downstream||Segment of the reference sequence 3’ of the indel event. For left-side breakpoints this field is set to the value ‘N/A’.|
|7||Q(indel)||Phred scaled quality score of the indel, which refers to probability that this indel does not exist at the given position. The Q-values given only reflect those error conditions which can be represented in the indel calling model, which is not comprehensive. By default the variant caller reports all indels with Q(indel) > 0.|
|8||max_gtype||Most probable indel genotype. The indel genotype categories are as follows:
hom refers to a homozygous indel.
het refers to a heterozygous indel.
ref refers to no indel at this position.
Note that these do refer to true genotypes where indels overlap because the model is not capable of jointly calling overlapping indels. In the case of overlapping indels, max_gtype refers to the most likely copy-number of the indel. Note that indel calls where ref is the most-likely genotype will be reported. These correspond to indels with very low Q(indel) values.
|9||Q(max_gtype)||Phred scaled quality score of the most probable indel genotype, which refers to the probability that the genotype of the indel is not that given as “max_gtype”. The Q-values given only reflect those error conditions which can be represented in the indel calling model, which is not comprehensive.|
|10||depth||Except for right-side breakpoints, this field reports the depth of the position preceding the left-most indel breakpoint. For right-side breakpoints this is the depth of the position following the breakpoint.|
|11||alt_reads||Number of reads strongly supporting either the reference path or an alternate indel path.|
|12||indel_reads||Number of reads strongly supporting the indel path.|
|13||other_reads||Number of reads intersecting the indel, but not strongly supporting either the reference or any one indel path.|
|14||repeat_unit||The smallest repeating sequence unit within the inserted or deleted sequence. For breakpoints this field is set to the value ‘N/A’.|
|15||ref_repeat_count||Number of times the repeat_unit sequence is contiguously repeated starting from the indel start position in the reference case.|
|16||indel_repeat_count||Number of times the repeat_unit sequence is contiguously repeated starting from the indel start position in the indel case.|
- Coverage cutoff parameter is not applied for both SNP call and indel calls
- All SNPs/indels detected are reported in the snps.txt/indels.txt file with following cutoffs Q(snp)>0 or Q(indels) > 0 and basecall quality Q(score) >0
- Summary Statistics reporting SNPs and indel cutoff Q(snp)>20 or Q(indels) >20
- PCR or optical duplicates are removed
- Reads with low alignment score are filtered (paired-end alignment score less than 90).
- Any reads which are not marked as being part of a ‘proper pair’ are removed from consideration. This is intended to remove any reads from chimeric pairs, with unmapped mates or with an anomalous pair insert size.
- ‘N’ bases on the end are trimmed
- Mismatch limited as 2 bases within the 20bases of the snp call.
- Remove SNP calls near centromeres to prevent spurious SNP calls.
How large are the files?
Because next-gen sequencing is still a rapidly evolving field, this answer changes regularly. Please contact the bioinformatics group for current data delivery file size information.
Sequence files will initially be made available by download through the CCR-SF LIMS. You will need to arrange an account for each lab member planning to log in. Data delivery via hard drive is no longer the default method, please be sure to specifically request HD delivery when submitting the CSAS if you wish to receive one.
For labs with the appropriate resources, we can quickly and directly transfer the files via ftp. Please contact the bioinformatics group to discuss your options.
The CCRIFX Bioinformatics core is a new on-site core resource providing bioinformatics analysis assistance to investigators of the Center for Cancer Research at NCI. If you need assistance for data analysis, please check CCRIFX core web site for the services and office hours they provide.
The original sequence and analysis files are available for download through the LIMS for two weeks after you receive an email confirming delivery of the data. It is the responsibility of the investigator to ensure they have retrieved their data within the time frame. To maintain sufficient data storage for upcoming runs, the analysis files are then archived and stored for an additional four weeks.
If your data is no longer available for download, please contact the bioinformatics group and we can re-run the data processing and alignment as necessary. However, please note that it may take longer to receive the re-analyzed data because of resource conflicts with current production runs. Whenever possible it is best to download the data in a timely manner after received the data delivery notice.
For further questions please contact us directly.
For ordering and pricing information, head back to Resources.