What can unassembled reads tell us?
Abstract
Genome annotation is a difficult computational challenge that is often reliant on the observation of previously discovered genes, both putative and predicted. Statistical analysis conducted on these genes and their host genomes is used to build representative models for describing their characteristics. Additionally, the numerous challenges associated with genome assembly, whether for cultured isolates or environmental DNA, introduce a host of additional complexities, particularly when dealing with metagenomic samples. As sequencing depth and costs have caught up and even surpassed computational capabilities, it is now common for large metagenomic assembly projects to not effectively incorporate large proportions, often up to half, of their read collection. However, while tools to study unassembled reads have been somewhat successful in studying function and taxonomy, they most often rely on alignments of the entire read to a precomputed database. This does not allow for the investigation of genes without database similarities or for the future reconstruction of the full gene product. Predicting gene content directly from unassembled reads can help overcome several variables such as assembly error and reduce computational complexity. Therefore, we provide a full evaluation framework for the prediction of genes, both fragmented and whole, directly from reads. We demonstrate that the additional insights provided by this framework for annotation correctness means that previous tools need to be comprehensively re-evaluated. Furthermore, we developed read annotation approaches, one utilising a Convolutional Neural Network and another purposefully built using naive assumptions, and found that their performance was similar or sometimes improved over contemporary state-of-the-art methods.