Journal of Leisure 閒記

It is indeed very hard to keep a homepage up-to-date. I couldn't count how many homepages I have created and then given up . As a result, I decided to write a blog, because it is much more easier to manage so I presume I will update it more often. But keeping a not-so-updated blog just means no one is going to read it. So I also decided to write something here that describes my daily life best, just to keep it not-so-outdated :-)
It's been a hard day's night
And I've been working like a dog
It's been a hard day's night
I should be sleeping like a log

Tuesday, March 13, 2012

CG vs. Illumina (Sensitivity)

Came across MJ's post in response to CG's post about our sequencing platform paper on Nature Biotechnology:


MJ pointed out a good point that our small set of Sanger sequencing data was only suggestive. Here is my thought.

A confidence level of 95% and a confidence interval of 5% for each of the platform specific call set requires a minimum sample size of ~380. Any further estimation based on a statistically insignificant set is inconclusive. That's why we went on to SureSelect at a larger scale, which gives us a statistically significant result.

As mentioned on the paper, the SureSelect may have potential bias since it was followed by Illumina sequencing. But if there is a strong bias towards Illumina due to systematic errors, probably the invalidation rate for Illumina itself wouldn't be as much as that for Complete.

Let's take the existing Sanger numbers and calculate it once again with its possible errors. With the same confidence level of 95% aforementioned, the possibly best validation rate for Illumina is 30% and the worst for Complete is 83%, which convert into 104K and 83K true positives in their specific call sets, respectively. That said, Illumina is still having a higher sensitivity, whereas Complete is more accurate (less FDR).

If it looks unfair, that's the problem of extrapolating on a set with big error bars. One thing that is true is that we can do a larger scale of Sanger sequencing on the specific calls, then we can have a better sense of the potential ground truth which will be less controversial.

Until then, we gotta believe that they both have their goods and bads, and performed very well overall.

Detecting and annotating genetic variations using the HugeSeq pipeline

Deciphering genome sequences is important for the mapping of genetic diseases and prediction of their risks. Advances in high-throughput DNA sequencing technologies using short read lengths have enabled rapid sequencing of entire human genomes and unlocked the potential for comprehensive identification of their underlying genetic variations. Various computational algorithms for identifying and characterizing
variants have been developed; however, most of these computational methods are neither integrated nor interoperable, making it difficult for biologists to extract all the genetic information from billions of sequences generated by these sequencing technologies. We developed HugeSeq, an integrated computational pipeline to fully automate the process of variant detection from alignment of these genomic sequences to detection and annotation of all types of genetic variations (single nucleotide polymorphisms (SNPs), short insertions or deletions (indels) and larger structural variations (SVs)).

Reference: Nature Biotechnology. 2012 Mar 7;30(3):226-9.



Performance comparison of whole-genome sequencing platforms

Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. We sequenced the genome of an individual with both technologies to a high average coverage of ~76X, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants (SNVs), insertions and deletions (indels). Although 88.1% of the ∼3.7 million unique SNVs were concordant between platforms, there were tens of thousands of platform-specific calls located in genes and other genomic regions...