Skip to content

Pipeline structure

A DAG schema is included with this code base

Some of the key things that happen inside GABI: - Perform quality control of the read data, and merge libraries across lanes - Downsample reads to the global default (= 100X; can be adjusted) - Group read data by sample id and check which assembly tool is appropriate based on the types of sequencing data we have available - Assemble reads with the optimal tool (chain) - Determine the species from the assembled genome sequence - Perform quality checks on the assembly, including taxonomic composition as contamination check - Perform genomic serotyping, if we have suitable tools available for that species - Perform MLST typing on the assembly, if we have a pre-configured database for that species - Annotate gene models in our assembled genome - Predict antimicrobial resistance genes from the assembly - Call variants from the assembly (mostly for QC purposes, since there shouldn't be any!) - Make a "pretty" QC report

Does GABI distinguish between chromosomal assembly and "full" assembly, and what does that even mean?

Yes, GABI does make that distinction - meaning that some parts of the pipeline see the assembly with all the contigs, and some others only see the bits of the assembly that are likely to represent the bacterial chromosomes (without plasmids).

Pipeline section Assembly used
AMR profiling All contigs
MLST profiling All contigs
Annotation All contigs
Serotyping All contigs
Reference genome Chromosomes
Variant calling Chromosomes
Coverages All contigs
Assembly QC All contigs

Why are we doing this? Basically, steps like idenfying the best reference genome match should not include plasmids, because these could throw off the algorithm. Likewise, if we want to know how complete our assembly is, this should ideally only include the chromosomes since the plasmids are generally "fluid" and not part of the "core" genome. That said, for most analysis steps, the entire assembly is used since the "biochemistry" of an isolate is determined by all the genes in the cell, not just those on the chromosomes.