Frequently Asked Questions


MetaWorks FAQs

Conda is an open source package and environment management system. Miniconda is a lightweight version of conda that only contains conda, python, and their dependencies. Using conda and the environment.yml file provided here can help get most of the necessary programs in one place to run this pipeline. If you install miniconda and activate the conda environment provided, then you will also get the correct versions of the open source programs used in this pipeline including Snakemake.

Install miniconda as follows:

# Download miniconda3
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install miniconda3 and initialize
sh Miniconda3-latest-Linux-x86_64.sh

# Add conda to your PATH, ex. to ~/bin
cd ~/bin
ln -s ~/miniconda3/bin/conda conda

# Activate conda method 1 (working in a container)
source ~/miniconda3/bin/activate MetaWorks_v1.13.0

# Activate conda method 2
conda activate MetaWorks_v1.13.0

Ensure the program versions in the environment are being used.

# Create conda environment from file. Only need to do this once.
conda env create -f environment.yml

# activate the environment
conda activate MetaWorks_v1.13.0

# list all programs available in the environment at once
conda list > programs.list

# you can check that key programs in the conda environment are being used (not locally installed versions)
which SeqPrep
which cutadapt
which vsearch
which perl

# you can also check their version numbers one at a time instead of running 'conda list'
cutadapt --version
vsearch --version

If you have an older version of GLIBC (on Centos6), then you may be missing the libc.so.6 file that ORffinder needs. Ideally, you would contact your system administrator to make a newer version of GLIBC available for ORFfinder. Not ideal, but in a pinch you can use conda to install a slightly newer version of GLIBC that will work and create some links so ORF finder knows where to find it. I've tested this solution and it works, but it does affect the versions of CUTADAPT and VSEARCH compatible with this GLIBC version in your conda environment (use 'conda list' to check program versions).

Use conda to install a slighly newer version of GLIBC

conda install -c pwwang glibc214

Create a symbolic link to the library:

cd ~/miniconda3/envs/MetaWorks_v1.13.0/lib
ln -s ../glibc-2.14/lib/libc.so.6 libc.so.6

Create the shell script file LD_PATH.sh in the following location to set the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/activate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH_CONDA_BACKUP=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Create the file LD_PATH.sh in the following location to unset the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/deactivate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_CONDA_BACKUP

Deactivate then reactivate the environment.

Test ORFfinder:

ORFfinder

The missing library (on Centos7) can be installed using conda and the LD_LIBRARY_PATH needs to be updated as follows:

Install libnghttp2 using conda

conda install -c conda-forge libnghttp2

Create the shell script file LD_PATH.sh in the following location to set the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/activate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH_CONDA_BACKUP=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Create the file LD_PATH.sh in the following location to unset the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/deactivate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_CONDA_BACKUP

Deactivate then reactivate the environment.

Test ORFfinder:

ORFfinder

The missing library can be installed using conda and the LD_LIBRARY_PATH needs to be updated as follows:

Install libdw.so.1 using conda

conda install -c conda-forge elfutils

Create the shell script file LD_PATH.sh in the following location to set the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/activate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH_CONDA_BACKUP=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Create the file LD_PATH.sh in the following location to unset the environment variable: ~/miniconda3/envs/MetaWorks_v1.13.0/etc/conda/deactivate.d/LD_PATH.sh

Put the following text in the LD_PATH.sh file:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_CONDA_BACKUP

Deactivate then reactivate the environment.

Test ORFfinder:

ORFfinder

Ensure that bold.hmm as well as the indexed files (bold.hmm.h3p, bold.hmm.h3m, bold.hmm.h3i, and bold.hmm.h3f) are available in the same directory as you snakefile.

Sometimes it necessary to rename raw data files in batches. I use Perl-rename (Gergely, 2018) that is available at https://github.com/subogero/rename not linux rename. I prefer the Perl implementation so that you can easily use regular expressions. I first run the command with the -n flag so you can review the changes without making any actual changes. If you're happy with the results, re-run without the -n flag.

Symbolic links are like shortcuts or aliases that can also be placed in your ~/bin directory that point to files or programs that reside elsewhere on your system. So long as those scripts are executable (e.x. chmod 755 script.plx) then the shortcut will also be executable without having to type out the complete path or copy and pasting the script into the current directory.

ln -s /path/to/target/directory shortcutName
ln -s /path/to/target/directory fileName
ln -s /path/to/script/script.sh commandName

Large jobs with many fastq.gz files can take a while to run. To prevent unexpected network connection interruptions from stopping the MetaWorks pipeline, set the job up to keep running even if you disconnect from your session:

1. You can use nohup (no hangup).

nohup snakemake --jobs 24 --snakefile snakefile --configfile config.yaml

2. You can use screen.

# to start a screen session
screen
ctrl+a+c
conda activate MetaWorks_v1.13.0
snakemake --jobs 24 --snakefile snakefile --configfile config.yaml
ctrl+a+d

# to list all screen session ids, look at top of the list
screen -ls

# to re-enter a screen session to watch job progress or view error messages
screen -r session_id

3. You can submit your job to the queuing system if you use one.

If you are targeting a broad group, such as Metazoa using COI primers, you can still filter out pseudogenes using removal method 1 that uses ORFfinder. This needs to be done in two steps because vertebrates and invertebrates each have different mitochondrial genetic codes.

1. Edit the config_ESV.yaml file as follows to process vertebrates first:

Set the grep search type to 1 and taxon1 to '-e Chordata' to target vertebrate animal phyla.

Set pseudogene removal method 1 (only uses ORFfinder).

Set genetic code to '2' to use the vertebrate mitochondrial genetic code for translation.

Run snakemake.

When done, create a new directory called chordata 'mkdir Chordata'.

Use 'ls -lhrt' to list files. Move each file AFTER rdp.out.tmp into a the Chordata directory so they do not get over-written in the next step: table.log, taxon.zotus, chimera.denoised.nonchimeras.taxon, orf.fasta.nt, longest.orfs.fasta, taxonomy.csv, ESV.table, results.csv.

2. Edit the config_ESV.yaml file to process invertebrates next:

Set the greap search type to 2 and taxon1 to '-e Metazoa' and taxon2 to '-v Chordata' to process all metazoan taxa, excluding Chordata (already processed above).

Keep pseudogene removal method 1.

Set genetic code to '5' to use the invertebrate mitochondrial genetic code for translation.

Run snakemake.

When done, create a new directory called Invertebrates 'mkdir Invertebrates'.

Use 'ls -lhrt' to list files. Move each file AFTER rdp.out.tmp into the Invertebrates directory: table.log, taxon.zotus, chimera.denoised.nonchimeras.taxon, orf.fasta.nt, longest.orfs.fasta, taxonomy.csv, ESV.table, results.csv.

The Chordata and Invertebrates results.csv files (or the component taxonomy.csv and ESV.table files) can then be combined in R during data analysis using 'rbind'.

If you get a warning and see that the last file created was rdp.csv.tmp but not the expected results.csv then you probably are probably trying to process a very large number of sequence files (thousands) and the job ran into memory problems preventing the creation of the final results.csv file. You have a couple options here:

a) Re-run the pipeline where there is more memory available so that the final outfile can be created. On the GPSC, you will need to create an interactive container, activate the conda environbment, then unlock the snakemake directory first before submitting another job with more memory on a large compute node.

b) Stop here. You can still use the ESV.table file together with the taxonomy.csv file for further processing in R. Both files are indexed by the same ESV ids.

There is a new MetaWorks workflow being created that handles analyses with very large number of samples more efficiently (in progress).

When snakemake begins, it locks the directory to prevent another snakemake instance from overwriting files. If a snakemake run was interruped, for example if you closed your laptop or lost power, the directory remains locked and won't let you re-start a run. To fix this enter your usual snakemake command but append --unlock at the end. Once that is done, run your usual snakemake command.

Because the COI Classifier v5 contains so many more taxa than v4, it needs more memory to run. Try increasing the amount of memory available from 8g to 16g. In the configuration file, section 'Taxonomic Assignment' -> 'RDP' -> 'memory' change -Xmx8g to -Xmx16g .

It takes a lot of memory to produce the final file that incorporates both the taxonomic assignments and ESV/OTU table. In the configuration file, consider selecting output option 2 instead of 1 that produces a separate ESV/OTU table and taxonomy file.


Data Analysis FAQs

The results.csv file is the final file in the MetaWorks pipeline. An ESV table (also known as an OTU table or pivot table) can be created in Excel or R. In R, changing the results.csv file from a wider to a longer tabular format can be done using different libraries.

# R using reshape2 library
ESVtable <- reshape2::dcast(df, SampleName ~ GlobalESV, value.var = "ESVsize", fun.aggregate = sum)

First, know that each custom-trained classifier has been screened for putative bacterial or human contaminants. These mislabelled records have been reported to GenBank and removed from the reference set. This data curation, done for you, should help remove false-positive assignments to putative contaminants that were labelled as something else.

Second, if you are working with a reference set that is built-in to the RDP classifier (16S prokaryote, fungal ITS, fungal LSU), use the cutoffs recommended by them, usually 0.8 unless your query sequence is < 250 bp long then use a 0.5 bootstrap support cutoff. Note that the expected accuracy as a result of these cutoffs will vary across each reference set. See the online documentation to get an idea of the expected range of accuracy. Note that this all assumes that all your query sequences are represented in the reference set. This assumption may not be valid for species from understudied regions of the world or poorly studied/cryptic/highly diverse species groups.

If you are working with one of our custom-trained classifiers, you can specifically filter your results for different levels of expected accuracy (80-99%). If you choose a higher expected accuracy, fewer of your assignments will pass bootstrap support thresholds. The lower the expected accuracy you can tolerate, the more assignments will pass bootstrap thresholds. On each github classifier page (see list of custom classifiers available on the main page), you will find bootstrap support cutoffs to ensure a certain level of expected accuracy that is unique for each classifier, approximate query length, and taxonomic rank.

The best way to filter your results is when you are doing your data analysis, for example, in R. The example below removes assignments ("") that do not meet the bootstrap proportion cutoffs recommended for a particular level of taxonomic assignment accuracy. Another option would be to recode low confidence assignments to "Unknown".

#read in MetaWorks results
COI <- read.csv("results.csv", header=TRUE, stringsAsFactors=FALSE)

# filter COI taxonomic assignments for 95% correct species, 99% correct genus & up
# See https://github.com/terrimporter/CO1Classifier for cutoffs
COI$Species <- ifelse(COI$sBP >= 0.7, COI$Species, "")
COI$Genus <- ifelse(COI$gBP >= 0.3, COI$Genus, "")
COI$Family <- ifelse(COI$fBP >= 0.2, COI$Family, "")

The ESV x sample table is generated before ITS extraction or pseudogene filtering so you may see a number of GlobalESV IDs associated with the same trimmed ITS or ORF sequence, which become identical after trimming, but those original denoised ESVs were unique. For convenience, the extracted ITS and good ORF sequences are provided in the results.csv file (this may be changed in future versions). If you wish to conduct further analyses with the unique ESVs (full length sequences before ITS or pseudogene trimming) then you can 'look up' the GlobalESV from the results.csv file to find the matching sequences from the cat.denoised.nonchimeras FASTA file. Note that you should not use the cat.denoised.nonchimeras file as is, as it will contain sequences with no taxonomic assignment because ITSx filtering failed, or because they represent pseudogenes.

This site is maintained by Teresita M. Porter (terrimporter AT gmail DOT com) and Artin Mashayekhi
Documentation License: CC-BY 4.0