Skip to content

Nextflow pipeline for whole-genome sequencing (WGS) analysis and variant calling in bacterial genomes using Illumina data, supporting de novo assembly and reference-based analysis.

License

Notifications You must be signed in to change notification settings

AMRmicrobiology/WGS-Analysis-VariantCalling

Repository files navigation

WGS-Analysis-VariantCalling

Contributors Forks Stargazers Issues license-shield

Introduction

This repository contains Nextflow-based pipeline for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data from bacterial genomes. It is designed to provide an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.

Current pipeline of the project

Contents

Pipeline summary:

All modes in the pipeline includes the following steps:

  1. Quality Control: Quality of raw sequencing data is assessed using FastQC. Low-quality bases and adapter sequences are removed with FastP, followed by another round of FastQC.

    • At this stage, the pipeline offers two modes that differ based on the input reference genome. Variant calling can be performed using either a de novo assembled reference strain or an existing reference genome. In the de novo, preliminary steps are performed to assemble the desired reference genome:

      • Assembly: Following quality control, de novo assembly is performed using SPAdes.
      • Genome QC: Structural quality metrics are evaluated with QUAST, while genome completeness is assessed using BUSCO.
      • Annotation: Genome annotation is carried out with Prokka and Bakta.

mode --reference and --novo

Reference and De-novo
After the reference genome is provided (de novo or an exisiting reference), the pipeline follows the same steps for both modes:

  1. Aggregation of quality reports: A summary report is genereated with MultiQC, incorporating the various FastQC reports and, depending on the mode, the QUAST genome quality report.
  2. Alignment: Reads are aligned against the selected reference genome with BWA-MEM, followed by processing with Samtools.
  3. Variant calling and filtering: Multiple steps are designed to identify, filter and annotate variants.
    • Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
    • Variant Filtering: Filters are applied to obtain high-confidence variant calls (see Parameters).
    • Genetic variant annotation: The toolbox SnpEff is used to annotate and predict the functional effects of genetic variants on genes and proteins.
  4. Post-assembly Analyses:
    • Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.
    • Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

mode --assemble

Assemble
For the --mode assemble, a simplifies pipeline is performed:

  1. Post-assembly Analyses:
    • Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.
    • Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
    • MLST analysis: ARIBA performs a fast MLST analysis, using the raw fastq data and MLST a slow MLST analysis using the genome assembly.
    • Staphylococcus aureus: In case --mrsa is true, the spaTyper and sccmec software analysis are performed.

Note

The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh

Installation

Prerequisites to run the pipeline:

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git

# Move inside the main directory
cd WGS-Analysis-VariantCalling

Local (conda)

To create a local conda environment type the following commands:

conda create -n WGS -f enviromentWGS.yaml
conda activate WGS

How to use it?

Run the pipeline using the following commands, adjusting the parameters as needed:

ASSEMBLE

nextflow run main.nf --mode assemble --input "/path/to/data/*_{1,2}.fastq.gz" --mrsa <true> -profile <docker/singularity/conda>

REFERENCE GENOME

nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>

DE NOVO

nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --wildtype_code "Pa01WT" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>

Parameters

--mode: Depends on the analysis - assemble/reference/novo.

--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).

--outdir: Directory where the results will be stored (default: out).

-profile: Specifies the execution profile (docker, singularity or conda).

--mrsa (only for --mode assemble): Specific for Staphylococcus aureus genome asseblies. It performs the spaTyper and sccmec software analysis (dafault: false).

--genome_name_db (only for --mode novo): Name of the organism that will name the database in SnpEFF.

--wildtype_code (only for --mode novo): Defines the sample that will be taken as reference.

--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.

Optional parameters

-w: Path to the temporary work directory where files will be stored (default: ./work).

Trimming

--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15

--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20

--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20

--length_required: reads shorter than length_required will be discarded. Default: 50.

Filter

--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".

--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".

Note

QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.

Reference:

In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays

Recommendations for clinical interpretation of variants found in non-coding regions of the genome

An ANI gap within bacterial species that advances the definitions of intra-species units

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing

About

Nextflow pipeline for whole-genome sequencing (WGS) analysis and variant calling in bacterial genomes using Illumina data, supporting de novo assembly and reference-based analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages