Abstract
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being gen-erated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.
Original language | English |
---|---|
Article number | 000850 |
Journal | Microbial Genomics |
Volume | 8 |
Issue number | 6 |
DOIs | |
Publication status | Published - 30 Jun 2022 |
Bibliographical note
Funding Information:This work is supported by the Wellcome Trust (Scalable Pathogen Pipeline for Turning Next Generation Sequencing Pathogen Data into Clinical Results) (215800/Z/19/Z); the National Institute for Health Research (NIHR) Health Protection Research Unit in Healthcare Associated Infections and Antimicro-bial Resistance (NIHR200915), a partnership between the UKHSA and the University of Oxford; the NIHR Health Protection Research Unit in Genomics and Enabling Data (NIHR200892), a partnership between the UKHSA and the University of Warwick. The views expressed are those of the authors and not necessarily those of the NIHR, UKHSA or the Department of Health and Social Care (UK). Acknowledgements We are grateful to the COG-UK Consortium for the public release of SARS-CoV-2 genome sequences used for software testing.
Funding Information:
This work is supported by the Wellcome Trust (Scalable Pathogen Pipeline for Turning Next Generation Sequencing Pathogen Data into Clinical Results) (215800/Z/19/Z); the National Institute for Health Research (NIHR) Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance (NIHR200915), a partnership between the UKHSA and the University of Oxford; the NIHR Health Protection Research Unit in Genomics and Enabling Data (NIHR200892), a partnership between the UKHSA and the University of Warwick. The views expressed are those of the authors and not necessarily those of the NIHR, UKHSA or the Department of Health and Social Care (UK).
Publisher Copyright:
© 2022 The Authors.
Keywords
- bacterial genomics
- microbial relatedness
- outbreak detection