Review for "Scaffolding and Completing Genome Assemblies in Real-time with Nanopore Sequencing"

Completed on 5 Oct 2016

Login to endorse this review.

Comments to author

Author response.

I read the manuscript by Cao et al., titled “Scaffolding and completing genome assemblies in real-time with nanopore sequencing” with great interest. The authors describe a method dedicated to de novo genome assembly of small genomes by combining data generated using the Oxford Nanopore and the Illumina sequencing technologies. Moreover, authors take advantage of the “real-time” capability of the MinION device, and they propose to improve an existing assembly in real-time. The manuscript particularly well described the impact of sequencing coverage on the assembly contiguity. The npScarf software is of broad interest for the nanopore users. The package is available online, easy to install and use.

1. The authors compared their method to existing tools which are based on two different strategies: use long reads to scaffold assemblies or error-correct long reads using short reads before the assembly step. However there is a third category based on a long-read only assembly (follow by a polishing of the consensus). The tools that belong to this category (Canu, Miniasm, Smart2Novo, as example) provide encouraging results (Castro-Wallace et al, Biorxiv, 2016 and Istace et al, BioRxiv, 2016) but are totally missing in the manuscript.

We have now included this category (de novo assembly + polish with short reads) in our manuscript [Page 1, colunm 2; Page 8 column 1]. We also included Canu and Miniasm in our comparison in this revision. As these methods are de novo assembly methods, they require significantly more long read data than our method; evidently, they failed to produce a decent assembly for the bacterial data sets (low coverage). On the yeast data set, they used twice as much data as npScarf, and yet their assemblies are less complete and less contiguous [Page 5; Page 7 column 2; Table 3].

The two references mentioned by the reviewer were posted on Biorxiv after the submission of our paper. We now included these in this revision. [Page 2, column 1].

2. The authors claimed that existing methods “have not made use of the real-time sequencing potential of the MinION”. It’s particularly true for methods that need large computational resources, but tools like miniasm for example are able to assemble a bacterial genome in a few seconds or minutes. Furthermore, before using npScarf one needs to produce a short-read assembly based on an illumina sequencing experiment which cannot produce data in real-time. In my opinion, a true “real-time assembler” should use nanopore data only.

As in the response to Reviewer 1, point 1, we agree that a true “real-time” assembler” should use only nanopore data. In that sense, the npScarf pipeline is not real-time as it requires the prior short read assembly. We have reframed our algorithm as a real-time scaffolder that can be used to control resources during MinION sequencing.

3. The comparison with existing tools is not fair. Indeed, authors used 50X of illumina reads to error-correct long reads with Nanocorr and NaS, but they used the complete dataset (250X) with other tools. A lower coverage will mainly result in uncovered regions, due to the sequencing technology bias, and lead to assembly fragmentation. The authors of Nanocorr and Nas tools reported near-perfect assemblies of E. coli genome (Goodwin et al, Genome Research, 2005 and, even with a low coverage of nanopore data. The results reported here are conflicting, is this due to a difficulty in setting the parameters?

We used only 50X of Illumina reads for error-correction following their recommendations (the accuracy of these methods does not increase with more than 30X of Illumina data). Even with only 50X coverage of Illumina, these methods took very long time (over 7000 CPU-hours on the yeast data set), and the running time grows (linearly) with the size of the Illumina data.

The E.coli data set from which most tool reported near perfect assemblies, was from three flowcells reported by Quick et al 2015. The total coverage from the three flowcells is 147-fold. We used data from only one flowcell (the R7.3 flowcell) which gave 67-fold coverage.

The coverage reported in the web site was the coverage of NaS assembly reads, that is reads that have been corrected. NaS used only 2D reads from the 147-fold coverage of the mentioned E. coli dataset (which make up only 25% of the dataset). These reads are then corrected, and only reads that can be corrected are selected and reported. These coverage statistics do not reflect the amount of raw nanopore data. In our manuscript, we reported the coverage based on all raw nanopore data (both 1D and 2D).

The assembly results from Nanocorr and Nas are very sensitive to the parameters (during assembly with Celera Assembler). We used the multiple specification files used by NaS and Nanocorr for each data set, and reported the best results. Our results from running these two methods on the yeast data set (the same data) were consistent with what reported in their publication. The results on the E. coli assemblies were different because we used only a subset of their data set, as described above.

4. I wonder if the gap filling approach is efficient, notably in tandem duplicated regions? As instance, yeast genomes contain tandemly duplicated genes. Does the number of copies observed in the npScarf assembly is consistent with the reference genome?

npScarf fills in the gaps by aligning the short contigs to the bridges in the gaps created by long reads. One thing to note is that the size of the gap between non-repetitive contigs (and hence the number of repeat units) is determined by the nanopore reads. It is difficult to draw firm conclusions on accuracy of repeat unit typing from the yeast sample (W303) as the reference sample is not the same as the sequenced strain (S228C). We feel that a comparison of the repeat length typing accuracy of npScarf to other assemblers is beyond the scope of the current paper.

5. The W303 npScarf assembly contains several chimeric contigs, are they due to the presence of transposable elements? If yes, this result should be discussed in the manuscript as it may indicate the limit of the method to deal with complex genomes.

eference strain (S228C) is different to the strain from which the sequence data was obtained (W303) and hence we cannot easily distinguish rearrangements between these two strains and genuine assembly errors. However, we can compare the number of apparent assembly errors between npScarf and other assemblers, and we see from this the npScarf is highly competitive. We have included dot plots of the Canu and Minasm assemblers for comparative purposes (Supp fig 3&4). We have inserted the following text to acknowledge the challenges posed by interspersed repeats to npScarf as well as other assemblers (page 7 col 1)

"We found these mis-assemblies were due to the presence of interspersed repeat
elements which are known for being problematic in assembly
analysis~\cite{TreangenS2012}. The assemblies produced by Canu and Miniasm also
presented several mis-assemblies fusing different chromosomes together, emphasising the challenges posed by interspersed repeats in assembling complex genomes
(See Supplementary Figures 4 and 5)."

Minor revisions:

1. The method first determines the multiplicity of each contig, it could be interesting to test the method in hard conditions, for instance with single cell amplified or aneuploid genomes where the coverage per base is heterogeneous.

This is an interesting suggestion. We will investigate this in further research.

2. Naming of ATCC 13883 scaffolds is different from the manuscript.

3. Legend of Sup Figure 3: S2880C instead of S288C

We have revised the manuscript to make sure namings are correct and consistent.