MJ pointed out a good point that our small set of Sanger sequencing data was only suggestive. Here is my thought.
A confidence level of 95% and a confidence interval of 5% for each of the platform specific call set requires a minimum sample size of ~380. Any further estimation based on a statistically insignificant set is inconclusive. That's why we went on to SureSelect at a larger scale, which gives us a statistically significant result.
As mentioned on the paper, the SureSelect may have potential bias since it was followed by Illumina sequencing. But if there is a strong bias towards Illumina due to systematic errors, probably the invalidation rate for Illumina itself wouldn't be as much as that for Complete.
Let's take the existing Sanger numbers and calculate it once again with its possible errors. With the same confidence level of 95% aforementioned, the possibly best validation rate for Illumina is 30% and the worst for Complete is 83%, which convert into 104K and 83K true positives in their specific call sets, respectively. That said, Illumina is still having a higher sensitivity, whereas Complete is more accurate (less FDR).
If it looks unfair, that's the problem of extrapolating on a set with big error bars. One thing that is true is that we can do a larger scale of Sanger sequencing on the specific calls, then we can have a better sense of the potential ground truth which will be less controversial.
Until then, we gotta believe that they both have their goods and bads, and performed very well overall.