Talks & Presentations

Sharing research insights and findings at conferences, workshops, and seminars worldwide.

PyamilySeq: Transparent and interpretable gene (re)clustering and pangenomic inference highlights the fragility of conventional methods.

🎤Conference

Genome Science UK: 2025

July 10, 2025

Nottingham, UK

Understanding the genetic diversity and functional potential of microorganisms depends on grouping their genes by sequence similarity, function or structure. This is particularly important in highly interactive microbiomes, such as the human gut or rumen, where microbial communities rely on intricate metabolic exchanges and interdependencies. However, commonly used gene clustering and pangenomic inference methods often lack transparency in their methodology and offer limited flexibility in parameterisation. These limitations introduce biases, potentially distorting the representation of microbial communities and their functional dynamics. This study introduces PyamilySeq, a flexible and transparent framework designed to systematically identify challenges in gene clustering and pangenomic analysis, and support the development of practical solutions. PyamilySeq enables users to replicate core components of widely used pangenome tools while providing greater control over parameters and clearer insights into their impact. Through comparative analysis, I demonstrate how changes in parameters, such as sequence identity cut-offs, length thresholds, and even computational settings like CPU and memory allocation, can significantly alter gene family composition. Therefore, PyamilySeq highlights that many reported gene families in contemporary tools are artefacts of the algorithms themselves, rather than biologically meaningful groupings. PyamilySeq reveals the fragility and tool-driven biases inherent to contemporary gene clustering and pangenome inference that too often goes unnoticed. This work aims for more biologically grounded insights into microbial diversity and function, critical for applications in antimicrobial resistance, pathogen surveillance, and microbial ecology, highlighting the importance of reducing reliance on rigid, one-size-fits-all tools to ensure a more precise representation of microbial diversity and function.

What can unassembled reads tell us?

🎤Conference

ISME19

August 24, 2024

Cape Town, South Africa

Genome annotation is a difficult computational challenge that is often reliant on the observation of previously discovered genes, both putative and predicted. Statistical analysis conducted on these genes and their host genomes is used to build representative models for describing their characteristics. Additionally, the numerous challenges associated with genome assembly, whether for cultured isolates or environmental DNA, introduce a host of additional complexities, particularly when dealing with metagenomic samples. As sequencing depth and costs have caught up and even surpassed computational capabilities, it is now common for large metagenomic assembly projects to not effectively incorporate large proportions, often up to half, of their read collection. However, while tools to study unassembled reads have been somewhat successful in studying function and taxonomy, they most often rely on alignments of the entire read to a precomputed database. This does not allow for the investigation of genes without database similarities or for the future reconstruction of the full gene product. Predicting gene content directly from unassembled reads can help overcome several variables such as assembly error and reduce computational complexity. Therefore, we provide a full evaluation framework for the prediction of genes, both fragmented and whole, directly from reads. We demonstrate that the additional insights provided by this framework for annotation correctness means that previous tools need to be comprehensively re-evaluated. Furthermore, we developed read annotation approaches, one utilising a Convolutional Neural Network and another purposefully built using naive assumptions, and found that their performance was similar or sometimes improved over contemporary state-of-the-art methods.