Step-by-step Installation guide
Forword
This document aims to be a step-by-step handy guide to your first installation, configuration, and usage of FooDMe2. It does not replace the workflow documentation taht can be found following the link in the left navigation panel. Following this guide alone will only bring you as far as setting up FooDMe2 for basic analyses on your system, do read the documentation to know how to properly use the workflow!
You can copy the code from each block and paste it in your shell to follow along. Click on the question marks in the code blocks to know more. You can also check the links to jump to the relevant documentation section.
Should anything be unclear or not work as expected, we ask for your understanding, maintaining this guide along the pipeline is considerable work and we may have forgotten something, or some information may be outdated. Please tell us about it by submitting an issue in the FooDMe2 Github repository.
Requirements
To follow along you will need:
- A UNIX system: any Linux distribution, MacOS, or Windows Subsystem for Linux.
- An working Internet connection, depending on your local setup the proxy settings.
- For the installation of Nextflow (compulsary) and Docker or Singularity/Apptainer (optional), you may need root (
sudo) access.
We assume basic knowledge of the UNIX shell and the examples used will rely solely on the analysis of paired-end illumina data of meat (mammals and birds) samples as a practical example.
Note
We provide some degree of explanations for the installation of Nextflow and dependency managers here, but we provide only very limited support for these. If you experience problems with the installation of nextflow and/or depedency managers, you should get in touch with your system administrator or IT support.
Installation
Installing Nextflow
The first step is to install the workflow manager Nextflow, which is going to take care of the workflow execution planning.
First check wether nextflow is installed on your system:
FooDMe should run with nextflow >= 24.10.5. If nextflow is installed with an older version run:
If nextflow is not installed, you will need to follow the step-by-step instructions from the Nextflow documentation. Note that you might need sudo rights!
Dependency managers
Nexflow relies on dependency managers for the deployement of software dependencies in self contained environments or containers. We always recommend that you run FooDMe2 with a container engine, such as Apptainer, Singularity, or Docker. Containers are faster to install and a more reliable way to ensure pipeline reproducibility over time. If containers are not an option in your settings or you are experiencing troubles getting them to work, you can also use the Conda/Mamba package manager.
Warning
Conda environments, though usually easy to deploy are not guaranteed to reproducibly install all dependencies in the same versions. For this reason we recommend that FooDMe2 be run with a container-based manager such as Apptainer/Singularity or Docker.
For a discussion on conda reproducibility, see PMID29953862.
Check if installed:
If it is not installed, check the official installation guide, or choose another dependency manager.
Check if installed:
If it is not installed, check the official installation guide, or choose another dependency manager.
Check if installed:
Ensure that your user is part of the Docker group ('docker'), or that Docker is otherwise configured to allow you to execute it.
If it is not installed, check the official installation guide, or choose another manager.
Check if installed:
If it is, great! Make sure it is configured properly to find and install the required packages. The conda-forge channel should be on top of
the channel list, followed by bioconda and channel priority should be set to strict.
If it is not the case you can run:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
If it is not installed, we recommend you to install the miniforge distribution following the official guide.
When you are done, configure the channels as above. You will need to close you shell and reopen it after the installation for the conda command to be available.
Nextflow configuration
In order to set some of the execution options for nextflow, like ressource allocation, dependency manager, or where to look for the locally installed databases, it is possible to configure Nextflow using one of these options:
- remote configuration (recommended) on the central
bio-raum/nf-configsgithub repository, allows to fully pre-configure FooDMe2 in a way that is specific to your setting with a singleswitch. - alternatively, one can use a local configuration file, although this limits the scope of possible configurations.
- no configuration (not recommended) is also possible, but you will have to manually provide parameters at each execution.
Note
Setting up a remote profile can take a little bit of time. If possible, plan ahead or use an alternative method in the meantime.
Warning
Working behing a proxy may require some extra configuration.
A few examples are given below; these can be extended as described in the Nextflow documentation.
Warning
All examples below assume that you run FooDMe2 on a local system (executor = local).
If you are planning to use this pipeline on a distributed compute system ("cluster"), or the cloud,
please refer to the Nextflow documentation to learn about alternative execution profiles.
An example for a Slurm cluster can be found here.
To upload your own configuration to the central bio-raum/nf-configs repository, open an issue
with your planned configuration or directly send a pull-request. Alternatively, you can use on of the existing configuration if it fits your needs.
Warning
Before using the examples below, make sure you set the resource limits of your compute environment (for example: 8 CPU cores, 30 Gb of Ram),
and adjust the paths that will be used to store the databases and dependencies on your system (reference_base and cacheDir).
If you run on a local system with a single user, you may install the references into your home directory
($HOME, /home/youruser). In a shared setting, particularly on distributed compute infrastructures, the references should be installed
to a shared directory where all users can access them.
For the following it will be assumed that the configuration file is saved under $HOME/nextflow as local.config.
!!! Warning {.callout-warning}
Before using the examples below, make sure you modify the ressources (max_cpus and max_memory) so that they fit your system,
and adjust the paths that will be used to store the dependencies on your system (cacheDir).
If you run on a local system with a single user, you may install the references into your home directory
($HOME, /home/youruser). In a shared setting, particularly on distributed compute infrastructures, the references should be installed
to a shared directory where all users can access them.
Installing the databases
Now that everything is installed and Nextflow is configured, we can install the databases that will be used for the analysis. Several different databases for different sources such as NCBI, MIDORI, or UNITE can be installed automatically and used in a standardized way by FooDMe2. You can also provide your own dataases.
Note
You will see several parameters used in the commands below that are not immediatly important. These will be explained further in tis document or you can check them up in the online documentation.
nextflow run bio-raum/FooDMe2 \
-profile remote \ # (1)!
-r 1.4.0 \
--build_references \
--run_name build \
--skip_genbank # (2)!
remoteis the name of your file in thebio-raum/nf-configsrepository, without the.configextension.- The Genbank core database can be installed by omitting this line, note that this will take up several hundred Gb and may require a substantial amount of RAM to use.
nextflow run bio-raum/FooDMe2 \
-c $HOME/nextflow/local.config \ # (1)!
-r 1.4.0 \
--build_references \
--reference_base $HOME/nextflow/refs \ # (2)!
--run_name build \
--skip_genbank # (3)!
- This should be the path to your local configuration file
- This is the path to the folder to which the databases will be downloaded.
- The Genbank core database can be installed by omitting this line, note that this will take up several hundred Gb and may require a substantial amount of RAM to use.
After the command above ran, the $HOME/nextflow/refs folder should contain a folder called foodme2 that itself contains several versioned databases.
Testing the installation
When the database installation is finished, it is time for the last installation step before we dive in the analysis: testing the workflow.
Using the test profile as below will download a couple of sequencing data from ENA and run a standard analysis workflow on them.
remoteis the name of your file in thebio-raum/nf-configsrepository, without the.configextension.
Nextflow should keep you informed of what is happening and after the workflow completed, you can check the results folder of
the current working directory to have a look at the results.
Tip
Congrats! We are done with the installation, and hopefully everything looks good.
Be sure to check the results/reports folder in particular contains the analysis result and useful QC metrics to evaluate your samples.
If you encountered an Error, check if you scuccesfully performed all the steps above.
Feel free to get in touch for help troobleshooting
or to report a problem.
Workflow management
The next steps will push further into real-life usage. But first let's talk about a few important features of Nextflow. In addition to running the workflow, Nextflow provides several utilities that are worth knowing (and using!).
Clean temporary data
Each step of the workflow is executed in a folder generated and managed by Nextflow under the work folder in the current directory.
Nescessary files are copied to the results folder after workflow completion.
Although Nextflow relies a lot on symlinks, the work folder can quickly grow full of old and unesscessary data.
It is generally a good practice to keep separate folders for your various analyses so that you don't have to deal with a balooning work directory. Still, at times you may want to clean up your pipeline runs:
- It is possible to simply delete the
workfolder manually using standard Unix commands.
- Nextflow also provides a
cleanutility that removes (most) temporary data, which you can incorporate in your workflow (see examples at the end of this document).
Update workflows
Nextflow natively handles workflow versionning, to get the last version of the workflow from the online repository, run:
Versioning
FooDMe2 is semantically versioned, using a MAJOR.MINOR.PATCH model where:
- MAJOR is a major version, possibly introducing breaking changes
- MINOR is a minor version, introducing changes in workflows that are backwards compatible
- PATCH is mainly used for bugfixes and documentation
Each new MAJOR or MINOR release requires a re-verification of the analytics. MAJOR changes can potentially require a re-validation.
Nextflow enables you you decide which version of the pipeline you use by providing a release tag to the -r argument:
To use the latest release use -r main, and to use the lastest development version use -r dev.
Tip
In order to ensure reproducibility, you should pin the workflow version for routine analysis.
Bug
While we do our best to release bug-free versions, the dev branch is under active development and may contain bugs or incomplete features.
Use it at your own risk and report any issues you may encounter.
Running an analysis
In order to run the workflow on your own data we will have to modify the above example by providing
your data as input and an analysis method instead of the test profile.
All options are listed in the online documentation.
There are two ways to provide data, assuming that the read files are saved under /home/user/metabarcoding/rawdata/ (replace user with your user name):
- Provide a sample-sheet linking sample names to the files as a tab-separated table (recommended):
sample fq1 fq2
S1 ~/home/user/metabarcoding/rawdata/S1_R1.fastq.gz ~/home/user/metabarcoding/rawdata/S1_R2.fastq.gz
Tip
The bio-raum github page contains a workflow that can automatically generate a sample-sheet for you!
See the example in the next section.
- Provide a path wildcard linking both read files. Note that you cannot provide sample names and FooDMe2 will not be able to deal with multi-lane libraries this way.
Where the * represent the variable part of the paths (usually the sample name, Illumina sample and Lane numbers) and {1,2} the numbering of the read pairs.
Variables will not be expanded in the wildcards so you have to provide a full path and use single quotes.
For example:
nextflow run bio-raum/FooDMe2 \
-profile remote \ # (1)!
-r 1.4.0 \ # (2)!
--run_name first_run \ # (3)!
--input $HOME/metabarcoding/rawdata/samples.tsv \ # (4)!
--primer_set 16S_ILM_ASU184_meat # (5)!
remoteis the name of your file in thebio-raum/nf-configsrepository, without the.configextension.- Specifies the version of the FooDMe2 workflow to use.
- Sets the name of the run to "first_run". This name will be used to label the output files.
- Specifies the path to the sample-sheet containing the sample information.
- Specifies the standard method to use for the analysis. List the available methods with
--list_primers.
nextflow run bio-raum/FooDMe2 \
-profile remote \ # (1)!
-r 1.4.0 \ # (2)!
--run_name first_run \ # (3)!
--reads '/home/user/metabarcoding/rawdata/*_R{1,2}.fastq.gz' \ # (4)!
--primer_set 16S_ILM_ASU184_meat # (5)!
remoteis the name of your file in thebio-raum/nf-configsrepository, without the.configextension.- Specifies the version of the FooDMe2 workflow to use.
- Sets the name of the run to "first_run". This name will be used to label the output files.
- Specifies the path to the paired read-files using a wildcard. Note the use of full path and single quotes!
- Specifies the standard method to use for the analysis. List the available methods with
--list_primers
nextflow run bio-raum/FooDMe2 \
-c $HOME/nextflow/local.config \ # (1)!
-r 1.4.0 \ # (2)!
--run_name first_run \ # (3)!
--input $HOME/metabarcoding/rawdata/samples.tsv \ # (4)!
--primer_set 16S_ILM_ASU184_meat \ # (5)!
--reference_base $HOME/nextflow/refs # (6)!
- Path to your local configuration file.
- Specifies the version of the FooDMe2 workflow to use.
- Sets the name of the run to "first_run". This name will be used to label the output files.
- Specifies the path to the paired read-files using a wildcard. Note the use of single quotes!
- Specifies the standard method to use for the analysis. List the available methods with
--list_primers - Path to the folder you used to install the references.
nextflow run bio-raum/FooDMe2 \
-c $HOME/nextflow/local.config \ # (1)!
-r 1.4.0 \ # (2)!
--run_name first_run \ # (3)!
--reads '/home/user/metabarcoding/rawdata/*_R{1,2}.fastq.gz' \ # (4)!
--primer_set 16S_ILM_ASU184_meat \ # (5)!
--reference_base $HOME/nextflow/refs # (6)!
- Path to your local configuration file.
- Specifies the version of the FooDMe2 workflow to use.
- Sets the name of the run to "first_run". This name will be used to label the output files.
- Specifies the path to the paired read-files using a wildcard. Note the use of full path and single quotes!
- Specifies the standard method to use for the analysis. List the available methods with
--list_primers - Path to the folder you used to install the references.
Tip
You can additionaly use the --output parameter to specify a path for the results folder.
For example adding the following parameter to the above command will write the results to
$HOME/metabarcoding/results/first_run instead of the results folder in the current directory.
Going further
Automating things
In a diagnostic laboratory setting, it is usefull automate things as much as possible to improve reproducibility and save time. It particular, it is possible to automate workflow execution using simple scripts.
In the following paragraphs we will set up a small script that allows to run the workflow on data present locally in a folder. Bear in mind that this is just a simple example that can be modified and expanded as will.
Folder structure
We already established an folder structure throughout this document. We will add a scripts folder in which we will save scripts, and an archive folder
to store rawdata after analysis.
It should look like this:
The idea here is that:
- users can copy their sequencing files into the
rawdatafolder - running the script will ensure the data is analysed and output into the
resultsfolder in meanigfully named subfolder - raw data files are moved in the
archivefolder - the
rawdatafolder is cleaned up and ready for new data
Creating the sample-sheet
For this example we will use the samplesheet workflow from the bio-raum Github collection.
The good news is that it is a standardized bio-raum workflow that works very similarly to FooDMe2. Assuming you adopted the folder structure above, you can create the sample sheet in the raw data folder with:
Output and run naming
We can automate run naming by using the execution date and a name for the run:
run_name="ilovefoodme"
run_id=$(date --iso-8601) # (1)!
mkdir -p $HOME/metabarcoding/results/${run_id}_${run_name} # (2)!
mkdir -p $HOME/metabarcoding/archive/${run_id}_${run_name} # (3)!
- Generate today's date as ISO8601 format 'YYYY-MM-DD'
- Create output folders with the date as filenames. The
-poptions allows reuse of existing folders. - Create output folders with the date as filenames. The
-poptions allows reuse of existing folders.
Tip
When writting a script, we can also use command line arguments to create the folder (and run) name:
run_name="${1:-foodme2}"
run_id=$(date --iso-8601)!!
mkdir -p $HOME/metabarcoding/results/${run_id}_${run_name}
mkdir -p $HOME/metabarcoding/archive/${run_id}_${run_name}
The snippet above will take any variable given after the script call and use it as run name, reverting to foodme2 as default.
See it used in the next section.
Putting it together
We can now put everything together in a neat little script:
#!/usr/bin/env bash
set -Eeuo pipefail # (1)!
VERSION='1.4.0' # (2)!
# create samplesheet
nextflow run bio-raum/samplesheet \
-r 0.1 \
-profile remote \ # (3)!
--input $HOME/metabarcoding/rawdata \
--outdir $HOME/metabarcoding/rawdata \
&& nextflow clean
# create output dirs
run_name="${1:-foodme2}"
run_id=$(date --iso-8601)
mkdir -p $HOME/metabarcoding/results/${run_id}_${run_name}
mkdir -p $HOME/metabarcoding/archive/${run_id}_${run_name}
# run workflow
nextflow run bio-raum/FooDMe2 \
-profile remote \ # (6)!
-r $VERSION \
--run_name ${run_id}_${run_name} \
--input $HOME/metabarcoding/rawdata/samples.tsv \
--primer_set 16S_ILM_ASU184_meat \
--outdir $HOME/metabarcoding/results/${run_id}_${run_name} \
&& mv $HOME/metabarcoding/rawdata/* $HOME/metabarcoding/archive/${run_id}_${run_name} \ # (4)!
&& nextflow clean # (5)!
- The first two lines are simply there to ensure that the correct shell is used and eventual errors are correctly reported.
- Putting the version at the top of the file so it is easy to trace.
- This should be the name of your remote configuration, without the
.configextension. - If nextflow finished succesfully, move all the content of the
rawdatafolder to a dated folder underarchive. - If nextflow finished succesfully, clean the work folder and cache.
- This should be the name of your remote configuration, without the
.configextension.
#!/usr/bin/env bash
set -Eeuo pipefail # (1)!
VERSION='1.4.0' # (2)!
# create samplesheet
nextflow run bio-raum/samplesheet \
-r 0.1 \
-c $HOME/nextflow/local.config \
--input $HOME/metabarcoding/rawdata \
--outdir $HOME/metabarcoding/rawdata \
&& nextflow clean
# create output dirs
run_name="${1:-foodme2}"
run_id=$(date --iso-8601)
mkdir -p $HOME/metabarcoding/results/${run_id}_${run_name}
mkdir -p $HOME/metabarcoding/archive/${run_id}_${run_name}
# run workflow
nextflow run bio-raum/FooDMe2 \
-c $HOME/nextflow/local.config \
-r $VERSION \
--run_name ${run_id}_${run_name} \
--input $HOME/metabarcoding/rawdata/samples.tsv \
--primer_set 16S_ILM_ASU184_meat \
--reference_base $HOME/nextflow/refs \
--outdir $HOME/metabarcoding/results/${run_id}_${run_name} \
&& mv $HOME/metabarcoding/rawdata/* $HOME/metabarcoding/archive/${run_id}_${run_name} \ # (3)!
&& nextflow clean # (4)!
- The first two lines are simply there to ensure that the correct shell is used and eventual errors are correctly reported.
- Putting the version at the top of the file so it is easy to trace.
- If nextflow finished succesfully, move all the content of the
rawdatafolder to a dated folder underarchive. - If nextflow finished succesfully, clean the work folder and cache.
Now make sure that the script is executable and run it:
chmod +x $HOME/metabarcoding/scripts/run_foodme2.sh # (1)!
bash $HOME/metabarcoding/scripts/run_foodme2.sh run_name # (2)!
- This ensures that the script is executable and is only required once.
run_nameis optional here, but allows you to specify a custom name for your run.
Validation
Like we saw for the test, FooDMe2 packs a profile for quick validation of the mammals and birds methods. Details of the execution and results can be found in the online documentation.
The validation can be executed with:
This will trigger the download of the dataset from the ENA and run FooDMe2 with the 16S_ILM_ASU184_meat primer sets configuration.
In addtion to normal runs, a confusion matrix will be produced, allowing to calculate metrics on precision and accuracy for the analysis.
It is possible that the connection with ENA breaks down, blocking the execution of the workflow. If restarting the command doesnÄt work, try to download the dataset manually from ENA under project number PRJEB57117.
You can then start the validation manually by providing the read files to the workflow as explained above, and additionally providing the ground truth table supplied with the workflow. You can see the table here. Either save the content to a new file or use:
wget -O $HOME/metabarcoding/dobrovolny_benchmark_groundtruth.csv https://raw.githubusercontent.com/bio-raum/FooDMe2/main/assets/validation/dobrovolny_benchmark_groundtruth.csv
You can then provide the ground-truth table to the workflow with:
nextflow run bio-raum/FooDMe2 \
-c $HOME/nextflow/local.config \
-r 1.4.0 \
--run_name validation \
--input $HOME/metabarcoding/rawdata/samples.tsv \
--primer_set 16S_ILM_ASU184_meat \
--reference_base $HOME/nextflow/refs \
--outdir $HOME/metabarcoding/results/validation \
--ground_truth $HOME/metabarcoding/dobrovolny_benchmark_groundtruth.csv
You can also provide you own truth table to perform validation using your own samples. To know how check the "Benchmarking" part of the Usage documentation.
Other methods
FooDMe2 is of course not limited to the birds and mammals method. Several other methods are currently being worked on and will be listed in the documentation when available.
You can analyze any kind of metabarcoding data without a standard method, this will require that you modify several analysis parameters. Which parameters can be accessed and what they do is described in the documentation.
Tip
If you followed until here, congratulations! You are all set to start using FooDMe2. In case of doubts, remember to read the documentation!