What is BigBacter?

BigBacter is a Nextflow pipeline for routine bacterial genomic surveillance. It accepts raw reads or assemblies, clusters samples by genomic similarity, constructs core genome alignments, and produces phylogenetic trees and pairwise distance matrices - all in an iterative, database-backed workflow designed to grow with your dataset over time.

Key Features

🧬 Iterative clustering - cluster assignments stay consistent across runs using a per-sample sourmash database that expands automatically with each new submission
🧬 Soft-core phylogenomics - retains substantially more phylogenetic signal than strict-core approaches by tolerating a configurable level of missing data
🧬 Automated reference selection - selects the most representative assembly per cluster using k-mer containment and assembly quality scoring; reuses the same reference on subsequent runs for consistent SNP distances
🧬 Dual distance metrics - reports both core-genome SNP distances and whole-genome containment scores to capture both SNP-level and accessory genome variation

Check out the documentation page to learn more!

Image