Abstract Background Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. Principal Findings We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths N50 of above 1 Mbp in Bacteria and above kbp in more complex organisms.
|Published (Last):||3 March 2008|
|PDF File Size:||11.2 Mb|
|ePub File Size:||2.1 Mb|
|Price:||Free* [*Free Regsitration Required]|
Search Menu Abstract Motivation: Many de novo genome assemblers have been proposed recently. The basis for most existing methods relies on the de bruijn graph: a complex graph structure that attempts to encompass the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized.
Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods. Alternatively it is available from authors upon request.
Supplementary information: Supplementary data are available at Bioinformatics online. The short length of the sequences coupled with high coverage and high level of noise has transformed de novo assembly to a tractable yet challenging proposition. The ease at which paired-end read libraries can be generated on these platforms is an added advantage. A number of works have been proposed to assemble short reads.
Such arbitrary criteria results in substandard assemblies that were often a compromise between contiguity and error rate. Furthermore, the approaches were not scalable to handle medium or large genomes; therefore, their use is restricted to assembling BAC clones or small bacteria genomes.
They were also not designed to make use of paired-end reads, thus greatly limiting their usefulness in assembling high-throughput data. The more practical approaches for assembling high-throughput short reads have spawned based on de Bruijn graph approach.
Velvet Zerbino and Birney, is perhaps the most widely used method for de novo genome assembly today. It is very fast in execution, fairly memory efficient and produces reasonably accurate assemblies.
Similar to all other methods based on de Bruijn graph, Velvet requires the entire genome to be stored in a graph structure. In the presence of noise, the graph may be too large to be stored on system memory. Furthermore, resulting assembly generated from Velvet tends to contain many errors at small repeat regions. However, in practice, we noted Velvet produces more contiguous and complete assemblies in comparison with Euler-USR.
One of the major shortcomings of de Bruijn graph approaches is the inability to parallelize the assembly process. This is a critical requirement as many powerful computers utilize multiple processors where numerous threads can be run seamlessly in parallel. However, we noticed that when executed in parallel in a multi-core single computer, Abyss does not offer any advantage over Velvet in term of execution time or memory usage.
To utilize Abyss efficiently, it requires a multi-node computing cluster that may seem a disadvantage in an era where computers are increasingly made faster by adding more cores within a single CPU. SOAPdenovo Li et al. It introduces an interesting hybrid approach where the genome is still stored as a large graph; however, the graph is separated into different segments and assembly of these segments can be carried independently.
This makes it possible to run some stages of Allpaths algorithm in parallel. The high accuracy of Allpaths is contributed by the fact that it tries all possible ways to assemble every segments; however, this comes at a tremendous cost in terms of time and memory usage, and therefore it will not augment well for larger genomes. We propose the method PE-Assembler that is capable of handling large datasets and produces highly contiguous and accurate assemblies within reasonable time.
However, it improves upon such early approaches in multiple ways. The extensive use of paired-end reads ensures that the dataset is localized within the region. Hence, our method can be run in parallel to greatly speedup the execution while staying within reasonable system requirements. Ambiguities are resolved using a multiple path extension approach, which takes into account sequence coverage, support from multiple paired libraries and more subtle information such as the span distribution of the paired-end reads.
The length of the fragment is referred to as the insert size. For every paired-end read, its two reads are called the mates of each other. The length of each read is denoted as ReadLength. It could be of any length from 25 to bp. The insert size is not exact.
It may vary from MinSpan to MaxSpan. Our program is called PE-Assembler, which aims to reconstruct the sample genome from a paired-end read library. PE-Assember can also accept multiple paired-end read libraries of different insert sizes, which can facilitate to resolve ambiguities that cannot be conclusively resolved using a single paired-end read library.
The procedure is illustrated in Figure 1. Given a sequence, PE-Assembler extracts all reads whose prefix aligns with the suffix of the sequence. We define this as an overlap. If there is a clear consensus for a single base, then that base is appended to the end of the sequence and the process is iterated. Multiple feasible extensions are handled differently in various stages of the algorithm and are described in following sections.
Both t and g are feasible extensions. PE-Assembler is implemented as a series of five steps, which are briefly described as follows also see Supplementary Fig. This step specifically avoids reads containing sequencing errors and reads occurring in repeat regions in the genome. Those successfully extended regions are called seeds. Seeds are long enough for extension using paired-end reads. Our third step called contig extension tries to extend all these seeds using paired-end reads. The resulting sequences are called contigs.
The fourth step links those contigs using paired-end reads to form scaffolds i. Finally, the last step tries to fill-in the gaps in between scaffolded contigs.
Below, we will detail the five steps. While it is generally effective in detecting and fixing random sequencing errors, it treats each read as a single read and therefore fails to utilize the pairing information. This may result in overcorrecting the reads coming from low coverage regions as the actual location of the paired-end read is not taken into account. Our approach does not perform error correction. However, we require a pool of error-free and non-repetitive reads as starting points for the seed building step Section 2.
These reads are isolated by carrying out a read screening step. The idea behind the screening step is similar to the kmer frequency based error correction method proposed by Pevzner et al. Its details are as follows. A kmer is a length k DNA sequence.
Provided the genome is sampled at a high coverage, a kmer that occurs in the genome is likely to occur multiple times in the input reads. Suppose a particular kmer occurs once or very sparingly in the input reads, such kmer is unlikely to occur in the target genome and is likely to be a result of a sequencing error.
Similarly, if a kmer occurs at a higher frequency than expected, we can conclude that it may have originated from a repeat region in the genome. To classify a read as either a solid kmer or a repeat kmer, we scan the entire dataset of reads to extract the set of kmers and their frequencies.
A kmer frequency histogram is plotted. Then, we identify the solid kmer threshold and the repeat kmer threshold from the troughs on either side of the main peak Fig. Only solid reads are chosen as the start points for the next step.
Note that this stage does not discard or correct any data. The entire dataset is used in the assembly as it is. The kmer frequency histogram.
We can determine the solid kmer cutoff and repeat kmer cutoff from the two troughs. Ambiguities arising due to repeats can be resolved with the help of paired-end reads. Throughout the seed assembly, we maintain a pool of reads whose mates map on to the current seed.
In case of any ambiguity, for every read overlapping with the seed, we check if its mate overlaps with any reads in the maintained pool Fig. Those without overlap support are assumed to be noise and thus discarded. The above method cannot resolve ambiguities arising due to sequencing errors. In such case, we extend every candidate base up to a distance of ReadLength. Any extension path arising due to sequencing errors is likely to be terminated prematurely.
If only one candidate path can reach the full distance, then that path is assumed to be the correct extension. At any stage, if there is no candidate for extension likely due to low sequencing coverage or multiple candidates for extensions possibly due to longer repeats , the extension is terminated.
Seed will then be extended from the other side. For every successfully terminated seed, a seed verification step is performed to ensure that the seed represents a contiguous region in the target genome. All verified reads are immediately subjected to contig extension step Section 2.
Seeds which fail the verification step are discarded. Again, this step relies on overlap extension to elongate the current contig; but with some differences. Since a contig is longer than MaxSpan, instead of using single reads to extend the contig, we try to identify feasible extensions from paired-end reads that overlap with the contig.
Moreover, when no paired-end read is found overlapping with the contig, we identify feasible extensions from overlapping reads instead. If a clear consensus is found among the feasible extensions, then that base is appended to the end of the contig and process is repeated.
Occasionally, there are multiple feasible candidates to extend the contig. Such scenario may arise due to three reasons. The first reason is sequencing errors. These errors can be dealt similar to the seed building step. The second reason is due to short tandem repeat regions. In such case, we stop the extension and we will try to estimate the correct number of tandem repeats during the gap filling step.
The third reason is due to long repeats. In such case we also terminate the extension.
A novel locally guided genome reassembling technique using an artificial ant system
Table S5: Selected regions for Sanger sequencing, coordinates and primer details. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest. Due to the nature of the raw data produced by these technologies, a full genomic sequence attained without the aid of Sanger sequencing has yet to be demonstrated. We have successfully developed a four-phase strategy for using only next-generation sequencing technologies Illumina and to assemble a complete microbial genome de novo.
Application of 'next-generation' sequencing technologies to microbial genetics
Palsson2, Derek R. Lovley3, Christian L. Craig Venter Institute, Rockville, Maryland, United States of America Abstract State-of-the-art DNA sequencing technologies are transforming the life sciences due to their ability to generate nucleotide sequence information with a speed and quantity that is unapproachable with traditional Sanger sequencing. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest.