Publications

This is a selection of publications. For a complete list, please see my Google Scholar profile.

StORF-Reporter: finding genes between genes

Published

2023

Published October 28, 2023

Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for genes using alternative start codons to be misreported or completely omitted. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns regions that may contain missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations. We show this methodology recovers genes missing from canonical genome annotations. We inspect the results of the genomes of model organisms, the pangenome of Escherichia coli, and a set of 5109 prokaryotic genomes of 247 genera from the Ensembl Bacteria database. StORF-Reporter extended the core, soft-core and accessory gene collections, identified novel gene families and extended families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORFs are likely to be functional genes that should now be considered for inclusion in canonical annotations.

DOI: https://doi.org/10.1093/nar/gkad814

Read

StORF-Reporter: finding genes between genes

Lineage-specific microbial protein prediction enables large-scale exploration of protein ecology within the human gut

Featured

Published

2025

Published April 3, 2025

Microbes use a range of genetic codes and gene structures, yet these are often ignored during metagenomic analysis. This causes spurious protein predictions, preventing functional assignment which limits our understanding of ecosystems. To resolve this, we developed a lineage-specific gene prediction approach that uses the correct genetic code based on the taxonomic assignment of genetic fragments, removes incomplete protein predictions, and optimises prediction of small proteins. Applied to 9634 metagenomes and 3594 genomes from the human gut, this approach increased the landscape of captured expressed microbial proteins by 78.9%, including previously hidden functional groups. Optimised small protein prediction captured 3,772,658 small protein clusters, which form an improved microbial protein catalogue of the human gut (MiProGut). To enable the ecological study of a protein’s prevalence and association with host parameters, we developed InvestiGUT, a tool which integrates both the protein sequences and sample metadata. Accurate prediction of proteins is critical to providing a functional understanding of microbiomes, enhancing our ability to study interactions between microbes and hosts.

DOI: https://doi.org/10.1038/s41467-025-58442-w

Read

Lineage-specific microbial protein prediction enables large-scale exploration of protein ecology within the human gut

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

Featured

Published

2022

Published March 1, 2022

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study - The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. Results We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.

DOI: 10.1101/2025.05.30.657108

Read

No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

PyamilySeq: Exposing the fragility of conventional gene (re)clustering and pangenomic inference methods

Preprint

2025

Nicholas. J. Dimonaco

Pangenomics, the identification of shared genes across a taxonomic range, is essential for understanding microbial genetic diversity. Yet, gene clustering and pangenome tools often operate as one-size-fits-all black boxes whose outputs are difficult to validate and interpret. This study introduces PyamilySeq, not a replacement, but a diagnostic framework to expose and quantify hidden limitations and guide the development of more robust methodologies. By evaluating widely used gene clustering and pangenome tools, we observe how clustering thresholds (often hard-coded and provided without explanation) and paralog handling impact gene family composition. Parameters unrelated to clustering thresholds, such as decimal precision (0.8 vs. 0.80), output selection, and even CPU and memory allocation, are demonstrated to alter gene family assignments, in striking contrast to the standard assumption that broadly the same parameter values will yield consistent results. Additionally, tools often fail to report biologically meaningful or representative sequences for gene families, leading to misguided downstream analyses. This work highlights key limitations in current gene clustering and pangenome methodologies, demonstrating their potential to influence biological interpretations. To advance the field, we must prioritise adaptable and transparent approaches and move beyond rigid, one-size-fits-all tools and parameter choices.

DOI: https://doi.org/10.1101/2025.05.30.657108

Read Scholar

PyamilySeq: Exposing the fragility of conventional gene (re)clustering and pangenomic inference methods