False Negatives Are a Significant Feature of Next Generation Sequencing Callsets

Created on 18th October 2016

Dean Bobo; Mikhail Lipatov; Juan L. Rodriguez-Flores; Adam Auton; Brenna M Henn;

Short-read, next-generation sequencing (NGS) is now broadly used to identify rare or de novo mutations in population samples and disease cohorts. However, NGS data is known to be error-prone and post-processing pipelines have primarily focused on the removal of spurious mutations or "false positives" for downstream genome datasets. Less attention has been paid to characterizing the fraction of missing mutations or "false negatives" (FN). Here we interrogate several publically available human NGS autosomal variant datasets using corresponding Sanger sequencing as a truth-set. We examine both low-coverage Illumina and high-coverage Complete Genomics genomes. We show that the FN rate varies between 3%-18% and that false-positive rates are considerably lower (<3%) for publically available human genome callsets like 1000 Genomes. The FN rate is strongly dependent on calling pipeline parameters, as well as read coverage. Our results demonstrate that missing mutations are a significant feature of genomic datasets and imply additional fine-tuning of bioinformatics pipelines is needed. To address this, we design a phylogeny-aware tool [PhyloFaN] which can be used to quantify the FN rate for haploid genomic experiments, without additional generation of validation data. Using PhyloFaN on ultra-high coverage NGS data from both Illumina HiSeq and Complete Genomics platforms derived from the 1000 Genomes Project, we characterize the false negative rate in human mtDNA genomes. The false negative rate for the publically available mtDNA callsets is 17-20%, even for extremely high coverage haploid data.

Show more

Review Summary

# Status Date