This repository contains Nextflow-based pipeline for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data from bacterial genomes. It is designed to provide an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.
All modes in the pipeline includes the following steps:
-
Quality Control: Quality of raw sequencing data is assessed using FastQC. Low-quality bases and adapter sequences are removed with FastP, followed by another round of FastQC.
-
At this stage, the pipeline offers two modes that differ based on the input reference genome. Variant calling can be performed using either a de novo assembled reference strain or an existing reference genome. In the de novo, preliminary steps are performed to assemble the desired reference genome:
-
Reference and De-novo
After the reference genome is provided (de novo or an exisiting reference), the pipeline follows the same steps for both modes:
- Aggregation of quality reports: A summary report is genereated with MultiQC, incorporating the various FastQC reports and, depending on the mode, the QUAST genome quality report.
- Alignment: Reads are aligned against the selected reference genome with BWA-MEM, followed by processing with Samtools.
- Variant calling and filtering: Multiple steps are designed to identify, filter and annotate variants.
- Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
- Variant Filtering: Filters are applied to obtain high-confidence variant calls (see Parameters).
- Genetic variant annotation: The toolbox SnpEff is used to annotate and predict the functional effects of genetic variants on genes and proteins.
- Post-assembly Analyses:
Assemble
For the --mode assemble, a simplifies pipeline is performed:
- Post-assembly Analyses:
- Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.
- Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
- MLST analysis: ARIBA performs a fast MLST analysis, using the raw fastq data and MLST a slow MLST analysis using the genome assembly.
- Staphylococcus aureus: In case --mrsa is true, the spaTyper and sccmec software analysis are performed.
Note
The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh
Prerequisites to run the pipeline:
- Install Nextflow.
- Install Docker or Singularity for container support.
- Ensure Java 8 or a later version is installed.
Clone the Repository:
# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git
# Move inside the main directory
cd WGS-Analysis-VariantCalling
To create a local conda environment type the following commands:
conda create -n WGS -f enviromentWGS.yaml
conda activate WGS
Run the pipeline using the following commands, adjusting the parameters as needed:
ASSEMBLE
nextflow run main.nf --mode assemble --input "/path/to/data/*_{1,2}.fastq.gz" --mrsa <true> -profile <docker/singularity/conda>
REFERENCE GENOME
nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>
DE NOVO
nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --wildtype_code "Pa01WT" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>
--mode: Depends on the analysis - assemble/reference/novo.
--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).
--outdir: Directory where the results will be stored (default: out).
-profile: Specifies the execution profile (docker, singularity or conda).
--mrsa (only for --mode assemble): Specific for Staphylococcus aureus genome asseblies. It performs the spaTyper and sccmec software analysis (dafault: false).
--genome_name_db (only for --mode novo): Name of the organism that will name the database in SnpEFF.
--wildtype_code (only for --mode novo): Defines the sample that will be taken as reference.
--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.
-w: Path to the temporary work directory where files will be stored (default: ./work).
--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15
--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20
--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20
--length_required: reads shorter than length_required will be discarded. Default: 50.
--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".
--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".
Note
QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.
In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays
Recommendations for clinical interpretation of variants found in non-coding regions of the genome
An ANI gap within bacterial species that advances the definitions of intra-species units
Evaluation of serverless computing for scalable execution of a joint variant calling workflow
Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing