WGS-Analysis-VariantCalling

Introduction

This repository contains Nextflow-based pipeline for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data from bacterial genomes. It is designed to provide an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.

Pipeline summary:

All modes in the pipeline includes the following steps:

Quality Control: Quality of raw sequencing data is assessed using FastQC. Low-quality bases and adapter sequences are removed with FastP, followed by another round of FastQC.
- At this stage, the pipeline offers two modes that differ based on the input reference genome. Variant calling can be performed using either a de novo assembled reference strain or an existing reference genome. In the de novo, preliminary steps are performed to assemble the desired reference genome:
  - Assembly: Following quality control, de novo assembly is performed using SPAdes.
  - Genome QC: Structural quality metrics are evaluated with QUAST, while genome completeness is assessed using BUSCO.
  - Annotation: Genome annotation is carried out with Prokka and Bakta.

mode --reference and --novo

Reference and De-novo
After the reference genome is provided (de novo or an exisiting reference), the pipeline follows the same steps for both modes:

Aggregation of quality reports: A summary report is genereated with MultiQC, incorporating the various FastQC reports and, depending on the mode, the QUAST genome quality report.

Alignment: Reads are aligned against the selected reference genome with BWA-MEM, followed by processing with Samtools.

Variant calling and filtering: Multiple steps are designed to identify, filter and annotate variants.

Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.

Variant Filtering: Filters are applied to obtain high-confidence variant calls (see Parameters).

Genetic variant annotation: The toolbox SnpEff is used to annotate and predict the functional effects of genetic variants on genes and proteins.

Post-assembly Analyses:

Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.

Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

mode --assemble

Assemble
For the --mode assemble, a simplifies pipeline is performed:

Post-assembly Analyses:

Mass screening of contigs for antimicrobial resistance or virulence genes using ABRIcate.

Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

MLST analysis: ARIBA performs a fast MLST analysis, using the raw fastq data and MLST a slow MLST analysis using the genome assembly.

Staphylococcus aureus: In case --mrsa is true, the spaTyper and sccmec software analysis are performed.

Note

The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh

Installation

Prerequisites to run the pipeline:

Install Nextflow.
Install Docker or Singularity for container support.
Ensure Java 8 or a later version is installed.

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git

# Move inside the main directory
cd WGS-Analysis-VariantCalling

Local (conda)

To create a local conda environment type the following commands:

conda create -n WGS -f enviromentWGS.yaml
conda activate WGS

How to use it?

Run the pipeline using the following commands, adjusting the parameters as needed:

ASSEMBLE

nextflow run main.nf --mode assemble --input "/path/to/data/*_{1,2}.fastq.gz" --mrsa <true> -profile <docker/singularity/conda>

REFERENCE GENOME

nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" -profile <docker/singularity/conda>

DE NOVO

nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --wildtype_code "Pa01WT" --genome_name_db ¨Acinetobacter_baumanii_clinical¨ -profile <docker/singularity/conda>

Parameters

--mode: Depends on the analysis - assemble/reference/novo.

--input: Path to input FASTQ paired-end files generated by Illumina sequencing (file format: .fastq.gz).

--outdir: Directory where the results will be stored (default: out).

-profile: Specifies the execution profile (docker, singularity or conda).

--mrsa (only for --mode assemble): Specific for Staphylococcus aureus genome asseblies. It performs the spaTyper and sccmec software analysis (dafault: false).

--genome_name_db (only for --mode novo): Name of the organism that will name the database in SnpEFF.

--wildtype_code (only for --mode novo): Defines the sample that will be taken as reference.

--personal_ref (only for --mode reference): Path to the bacterial reference genome FASTA file.

Optional parameters

-w: Path to the temporary work directory where files will be stored (default: ./work).

Trimming

--cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 15

--cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. Default: 20

--cut_mean_quality: the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20

--length_required: reads shorter than length_required will be discarded. Default: 50.

Filter

--qual_snp: One or more expressions used with INFO fields to quality filter SNPs. Default "QUAL < 50.0 || MQ < 25.0 || DP < 30".

--qual_indel: One or more expressions used with INFO fields to quality filter INDELs. Default: "QUAL < 200.0 || MQ < 25.0 || DP < 30".

Note

QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.

Reference:

In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays

Recommendations for clinical interpretation of variants found in non-coding regions of the genome

An ANI gap within bacterial species that advances the definitions of intra-species units

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
bin		bin
data		data
subworkflow		subworkflow
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PipelineCP_V2.0.png		PipelineCP_V2.0.png
README.md		README.md
enviromentWGS.yaml		enviromentWGS.yaml
main.nf		main.nf
nextflow.config		nextflow.config
organisms_list.txt		organisms_list.txt
pubmlst_getter.py		pubmlst_getter.py
rest.pubmlst.org.json		rest.pubmlst.org.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

mode --reference and --novo

mode --assemble

Installation

Local (conda)

How to use it?

Parameters

Optional parameters

Trimming

Filter

Reference:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AMRmicrobiology/WGS-Analysis-VariantCalling

Folders and files

Latest commit

History

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

mode --reference and --novo

mode --assemble

Installation

Local (conda)

How to use it?

Parameters

Optional parameters

Trimming

Filter

Reference:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages