Exercises for Week 5

Recap job submission on Cerberus

Before moving forward, I will be going over with you again on how to install tools and run jobs on Cerberus, a remote computer you have been connecting to run some of the more computationally intensive jobs.

Important things to remember

  • When running computational tasks on cerberus, do not just copy-paste the command examples (you need to use an sbatch job submission script to submit jobs)

  • To create an sbatch script, you have a few options:

  1. Connect to cerberus remotely and create the sbatch script using a command line text editor such as vim, emacs, nano, pico, etc. This is a preferred way of creating an sbatch script.

  2. Create the sbatch script on your laptop computer using one of the text editors mentioned above and uploading the sbatch script to your exercises folder on cerberus. Beware, files created using Windows computer may not be compatible with Linux systems and they may cause problems when you try to submit the job on cerberus. Sometimes, if you create a file on a Windows computer, you may have to convert it to Unix-compatible format using the dos2unix command. This command may or may not be present on the remote computer. If it is not present, then you will have to install it using conda.

Install tools needed for today’s exercises

You need to install a few tools that will be needed for today’s exercises. They are not already installed on Cerberus and therefore you cannot use the module load command to load them in your sbatch script. So you need to install them through the conda command and then run them. Once they are installed using conda, you should be able to run them just like you would run tools on your laptop. You do not need to specify module load for tools installed using conda.

The list of tools you should install on cerberus using the conda command are:

- megahit
- unicycler
- bbmap
- samtools
- sra-tools
- seqtk
- filtlong
- metabat2
- checkm-genome

Search for them and install them using conda:

conda install megahit
conda install unicycler
conda install bbmap
conda install samtools
conda install sra-tools
conda install seqtk
conda install filtlong
conda install metabat2
conda install checkm-genome

Run metagenome assemblies on cerberus

You have run SPAdes on your laptop and on a remote computer last week. This week, you will be using SPAdes and specifically metaSPAdes to run a metagenome assembly on cerberus. SPAdes is a genome assembler that was originally design to assemble single-cell genomic data but has since evolved to become a multipurpose assembler able to handle a wide variety of sequencing data. The authors who developed the tool have written a specific version of SPAdes known as “metaSPAdes” that you will be using today in class. See here for the paper describing metaSPAdes: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/.

The metagenomic data you downloaded last week (and should have uploaded to cerberus last week) is here: https://www.ncbi.nlm.nih.gov/sra/SRX4741377. The SRA accession number to download this data is SRR7905025. Make sure that you are using this data to run the metagenomic assembly today.

It should be stored in your data folder on cerberus ideally. If your assignment number 3, I have asked you to run fastqc to check the sequence quality. If you haven’t already done so, please do that now. Next, you will be trimming your raw metagenomic reads using bbduk.sh tool you used a few weeks ago to trim away adapter regions and to improve sequence quality. You should be using the sbatch command to submit jobs on cerberus and not directly typing the commands in the terminal.

Trimming raw metagenomic reads

An example sbatch script to submit jobs to trim the metagenomic reads is provided below but you must edit them to make sure it runs on your own “exercise” directory. Again, do not copy-paste this copy into your terminal window after connecting to cerberus. You need to create an sbatch file using vim or something similar, then paste the command into the newly created text file. Make sure that all the files in this example are present in the folders where this sbatch script will be run. Otherwise, edit the file path accordingly.

#!/bin/bash
#SBATCH -o trim.%j.out
#SBATCH -e trim.%j.err
#SBATCH -N 1
#SBATCH -J TRIM
#SBATCH -t 2:00:00
#SBATCH -p defq

bbduk.sh ktrim=r ordered minlen=50 mink=11 rcomp=f k=21 ow=t zl=4 \
    qtrim=rl trimq=20 \
    in1=SRR7905025_1.fastq.gz \
    in2=SRR7905025_2.fastq.gz \
    ref=~/adapters.fa \
    out1=SRR7905025_1.trimmed.fastq.gz \
    out2=SRR7905025_2.trimmed.fastq.gz

Depending on where this sbatch script is residing, you will run this script directly in the folder where the raw metagenomic data are located. Let’s say if it is named as run_trim.sh, you will then type sbatch run_trim.sh in the data folder if the files are there. After bbduk.sh tool is done, you can directly use the trimmed fastq files as input for metagenome assemblies.

Running metagenomic assembly using Megahit

We will be using a metagenome assembly tool known as megahit to perform a metagenome assembly today. It is a tool that runs a lot faster with less memory requirements than metaSPAdes but known to be less accurate. Because it takes shorter amount of time, you will be able to inspect the assembly results during class. metaSPAdes runs longer so you would submit the job after you have submitted Megahit runs so it will continue to run after class is over.

The paper describing Megahit is available here (https://pubmed.ncbi.nlm.nih.gov/25609793/) and the github page can be found here (https://github.com/voutcn/megahit). You can also see how to run this assembler at this wiki page (https://sites.google.com/site/wiki4metagenomics/tools/assembly/megahit).

An example command to run a metagenome assembly looks like this:

megahit -1 SAMPLE_1.fastq  -2 SAMPLE_2.fastq  -m 0.5  -t 12  -o megahit_result

Where -m 0.5 indicates that the tool should use 50% of the available computer memory and -t 12 is to use 12 CPUs cores to run this assembly. On cerberus, you should run megahit through the use of an sbatch script. An example is shown below.

#!/bin/bash
#SBATCH -o mega.%j.out
#SBATCH -e mega.%j.err
#SBATCH -D /home/jsaw/exercises
#SBATCH -J MEGA
#SBATCH -N 1
#SBATCH -t 4:00:00
#SBATCH -p defq

megahit \
    -1 ../data/SRR7905025_1.trimmed.fastq.gz \
    -2 ../data/SRR7905025_2.trimmed.fastq.gz \
    -t 16 \
    -o megahit_assembly

Here, I am running this assembly in my exercises folder where megahit assembly results will be stored in the subfolder megahit_assembly. You should make sure the sbatch script is customized for your folders and file names. It will take less than 30 minutes to run this assembly. While the job is running, continue with the metaSPAdes assembly run.

Running metagenomic assembly using metaSPAdes

You will use the same trimmed fastq files as input to run metaSPAdes. An example sbatch script is shown below but again, you should make sure to change a number of things to make sure it runs on cerberus:

#!/bin/bash
#SBATCH -o asm_meta.%j.out
#SBATCH -e asm_meta.%j.err
#SBATCH -D /home/jsaw/exercises
#SBATCH -J ASM_META
#SBATCH -N 1
#SBATCH -t 4:00:00
#SBATCH -p defq

module load spades/3.14.1

spades.py \
    --meta \
    -m 300 \
    -t 16 \
    --pe1-1 ../data/SRR7905025_1.trimmed.fastq.gz \
    --pe1-2 ../data/SRR7905025_2.trimmed.fastq.gz \
    -k 21,33,55,77 \
    -o spades_meta

You will notice that the commands are actually very similar to what you used to assemble Salmonella enteria genome. The only major difference here is that you are specifying the --meta flag to indicate that you are running a metagenomic assembly. This flag is crucial and would mean a big difference between assembling a single genome versus multiple genomes from a community (i.e., metagenomes). I am also specifying the -m 300 to indicate that there is 300 GB memory limit with the compute nodes on cerberus.

Warning: It will take a while to run and complete metaSPAdes due to the large metagenomic data size. In my test run, it took about 3 hours to complete. So you will not be able to inspect the results until after the class is over but you should still submit the job to put it in the job queue. If the job did not finish within the 4 hours allowed, there is an option to continue the assembly. SPAdes can continue from the last check point based on the log and intermediate files it produced in a previous run.

Checking assembly metrics using Quast

You should be able to inspect the metagenome assembly produced by megahit during class but you would have to inspect the metaSPAdes assembly results most likely after the class. Remember the quast.py command you used last week to check assembly metrics? Use the same type of commands. You can either run Quast on cerberus or you can first download the assembly results to your laptop and then run Quast on your local computer. It is up to you. However, you will still need to download the results to your laptop to view some of the files (such as PDF files or html files showing interactive plots).

I will not display the rsync command to download files now because some of you have not completed the assignment 3 and I do not want to give away the answers here. I want you to first figure out how to download files to your laptop.

Checking megahit assembly results

I ran a megahit assembly on cerberus and it produced a folder name megahit_assembly. If you look into the folder, you will see the following files:

Megahit result

You will need to run Quast on the files named final.contigs.fa file.

Checking metaSPAdes assembly results

As for metaSPAdes assembly, you should look into the output folder produced by metaSPAdes and run Quast on the contigs.fasta file. Make sure you run Quast on both Megahit and metaSPAdes assemblies. Then compare assembly metrics produced by the two different assemblers. What do you see?

Download data for unicycler assembly

Today, in addition to running metagenomic assemblies, you will learn how to run hybrid assemblies using both short-read Illumina and long-read Nanopore sequences. The tool we will be using for this hybrid assembly is known as “Unicycler” and the paper describing the tool is found here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005595 and the Github wiki page to learn more about the tool is found here: https://github.com/rrwick/Unicycler.

This tool was specifically written to be able to handle long-read Nanopore sequences that can reach well over several thousands or million of bases in some cases. This tool, in conjunction with the long-read Nanopore data, makes it possible to assemble complex genomes with highly repetitive sequence regions that would otherwise interfere with correct genome assembly. It also makes it possible to assemble near-complete genomes due to long reads that can help to span multiple gap regions that usualy exist in normal short-read assemblies.

Today, we will use publicly available hybrid data used in a recent study to assemble the complete genome of Janthinobacterium lividum EIF2 published in this journal: https://academic.oup.com/gbe/advance-article/doi/10.1093/gbe/evaa148/5870831

The raw data have been deposited to NCBI at the following links:

https://www.ncbi.nlm.nih.gov/sra/SRR11070452

https://www.ncbi.nlm.nih.gov/sra/SRR11070453

Download those two files to your data folder on cerberus. Use the fasterq-dump command which will be available after installing the sra-tools through conda.

Go to your data folder and create an sbatch script that contains these commands:

fasterq-dump --split-3 SRR11070452

fasterq-dump --split-3 SRR11070453

Then submit the sbatch script to run the fasterq-dump tool. At the end of the download process, you should have 2 fastq files that are from Illumina sequencing and 1 fastq file from Nanopore sequencing. You would still need to perform read trimming on those reads. Use the bbduk.sh example I have shown for trimming metagenomic reads (scroll up) and write an sbatch script to run read trimming of Illumina read using bbduk.sh.

To perform read trimming of Nanopore reads by quality, we use the tool filtlong. See here: https://github.com/rrwick/Filtlong

An example command to trim Nanopore reads is shown below:

filtlong --min_length 1000 --keep_percent 90 reads.fastq.gz > reads.trimmed.fastq.gz

Submit an sbatch script containing a command similar to above to trim your Nanopore reads.

Run unicycler assembly on cerberus

After downloading the data and performing read trimming, you should be able to run unicycler on cerberus. Again, make sure to submit the job using an sbatch script. An example script is shown below.

#!/bin/bash
#SBATCH -o assm.%j.out
#SBATCH -e assm.%j.err
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -D /home/jsaw/exercises
#SBATCH -J ASSM
#SBATCH -t 4:00:00

unicycler \
    -1 ../data/hybrid/SRR11070453_1.fastq.gz \
    -2 ../data/hybrid/SRR11070453_2.fastq.gz \
    -l ../data/hybrid/SRR11070452.fastq.gz \
    -t 24 \
    -o unicycler_assembly

The -1 and -2 are to specify the Illumina read pairs and the -l is to specify the Nanopore read. The -t 24 is telling the assembler to use 24 CPU cores and the -o unicycler_assembly flag is to redirect output to that folder. Submit an sbatch script similar to this one and it will run for a few hours. Unicycler also uses SPAdes during certain stages to correct errors usually present in Nanopore reads using the Illumina data. Let it run and you can inspect the results after the class is over.

Run Metabat2 on your own after class

Metabat2 is a high-throughput tool to perform metagenomic binning using differential coverage binning approach. It ideally requires multiple metagenomic samples to accurately bin (separate) contigs produced by metagenome assemblers into respective bins that should theoretically belong to individual organisms. You can read the paper about the tool here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662567/

And the code repository where they explain how to use it is located here: https://bitbucket.org/berkeleylab/metabat/src/master/

After metaSPAdes assembly is finished, you will use Metabat2 to perform binning of contigs produced by the assembler. There are several steps involved in binning metagenomes and I have written down these steps in a minimal example as shown in the sbatch script below. You will use this sbatch script as an example to run Metabat2 run on your own (do this on cerberus). Again, please do not copy-paste this command on your terminal window on Cerberus. You will need to create an sbatch script to submit a job to run a proper computational job on the remote computer.

#!/bin/sh
#SBATCH -o binning.%j.out
#SBATCH -e binning.%j.err
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -D /home/jsaw/exercises
#SBATCH -J binning
#SBATCH -t 4:00:00

## extract contigs
cd spades_meta
seqtk comp contigs.fasta | awk '{if($2 >= 1000) print $1}' > contigs_gt1kb.list
seqtk subseq contigs.fasta contigs_gt1kb.list > contigs_gt1kb.fasta

## run mapping and sorting bam files
mkdir binning
cd binning
bbwrap.sh \
    ref=../contigs_gt1kb.fasta \
    in1=../../../data/SRR7905025_1.trimmed.fastq.gz \
    in2=../../../data/SRR7905025_2.trimmed.fastq.gz \
    out=SRR7905025.bam \
    t=16 \
    nodisk

samtools sort -O bam -@ 16 SRR7905025.bam > SRR7905025.sorted.bam
rm SRR7905025.bam

## summarize bam files
jgi_summarize_bam_contig_depths \
    --outputDepth depth_min1500.txt \
    --pairedContigs paired_min1500.txt \
    --minContigLength 1500 \
    --minContigDepth 2 *.sorted.bam

## run metabat
metabat2 -i ../contigs_gt1kb.fasta -a depth_min1500.txt -o bins/bin -t 24

As you can see in the example commands, you need to use a number of tools here. For example, seqtk, bbwrap.sh, samtools, jgi_summarize_bam_contig_depths, and metabat2 are new commands you are not familiar with. Before they are avaiable in your commandline prompt on Cerberus, you will need to make sure you install the following tools using the conda command.

  • seqtk

  • bbmap

  • samtools

  • metabat2

More instructions will be posted later on interpreting the binning results.

Run CheckM on your own after class

CheckM is a commonly used tool to check genome completeness and contamination levels mostly in metagenome-assembled genomes (MAGs). The paper describing the tool and its usage can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387/.

Briefly, the tool checks for conserved single-copy marker genes present in microbial (bacterial and archaeal) genomes to determine how complete a genome assembly is and if there are any signs of contamination. Contamination is especially important factor to consider in metagenome-assembled genomes because binning tools are not perfect and they can sometimes put contigs originating from two organisms into the same bin, thereby creating an artifical genome construct that is not real. This will cause a lot of problems downstream when you are analyzing these genomes and discover that some contigs that don’t belong are showing up in your analysis and thus throwing of your interpretation of the gene contents or other things.

Therefore, it is imperative that you carefully check for signs of contamination and to inspect MAGs that might have been contaminated through the binning process. CheckM is therefore useful for quickly assessing both completeness and contamination levels of assembled genomes. To install checkm, you can type conda install checkm-genome first to install it through conda.

After this step, you need to download and configure the reference database it needs to look up when anaylzing the MAGs you have obtained. The reference data can be downloaded here:

https://data.ace.uq.edu.au/public/CheckM_databases/

On cerberus, I downloaded the file in a folder named as checkm_data by typing:

wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz

Then to extract and uncompress data from this zip archive file, I typed:

tar -xzf checkm_data_2015_01_16.tar.gz

This command will release a bunch of files that checkm needs to correctly run. After typing the tar command, I type the following to set this folder as the root folder to contain all the reference data.

checkm data setRoot .

This should be it to correctly install and set up checkm to run. You should first make sure you have results available to analyze from your metagenome binning work. This means you need to first make sure that Metabat2 run correctly and then you can run checkm on the results.

Checking completeness and contamination of MAGs

To check completeness and contamination of MAGs obtained by Metabat2 run, go into the directory where the results are located. For example, I would go into the directory where I ran the metaSPAdes assembly, which is located here:

cd ~/exercises/spades_meta

There is a subfolder named binning that was created as a result of the Metabat2 run I just did. In this binning folder, there should be another subfolder named bins. This folder should contain metagenome-assembled genome bins with extension “.fa”. You should create an sbatch script in the spades_meta folder, for example that looks like this:

#!/bin/bash
#SBATCH -o checkm.%j.out
#SBATCH -e checkm.%j.err
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -D /home/jsaw/exercises/spades_meta
#SBATCH -J CHECKM
#SBATCH -t 4:00:00

checkm lineage_wf -f CheckM.txt -t 24 -x fa binning/bins checkm_out

So this sbatch script is directing checkm to look into the binning/bins subfolder where the MAGs are located. The -x fa just means to look for all files with extension “.fa”. It will run using 24 CPU cores as indicated by the -t 24 option. It takes a while to get results in but at the end of the process, you wil have a text file name CheckM.txt which contains completeness and contamination estimates for the MAGs. You can see the contents of this text file by typing cat CheckM.txt to print file content to your terminal.

Run RefineM on your own after class

The research group developed CheckM also developed a tool known as RefineM to automatically detect and remove contigs in metagenome-assembled genomes (MAGs) that may otherwise require manual intervention. This is very helpful if you have hundreds or thousands of MAGs to analyze. This tool automates the process. However, manual bin refinement is still important and should not be overlooked, especially for evolutionarily significant lineages that require careful analysis. You should have already installed RefineM tool using conda. I have provided an example sbatch script below on how to run refinem on cerberus.

#!/bin/bash
#SBATCH -o refinem.%j.out
#SBATCH -e refinem.%j.err
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -D /home/jsaw/exercises/spades_meta
#SBATCH -J ASSM
#SBATCH -t 4:00:00

cd binning/
samtools index -@ 4 SRR7905025.sorted.bam
cd ../

refinem scaffold_stats -c 4 contigs_gt1kb.fasta binning/bins refinem_out binning/*.sorted.bam --genome_ext fa
refinem outliers refinem_out/scaffold_stats.tsv refinem_out
refinem filter_bins binning/bins refinem_out/outliers.tsv filtered_bins --genome_ext fa

This script assumes that you have already run metagenomic binning using Metabat2. The files that refinem are expecting have been produced as a result of metabat2 run that was previously performed. At the end of this process, you will have a folder named filtered_bins which will contain refined bins/MAGs (same number as produced by Metabat2) that should have cleaner bins with most likely contaminants removed from them. To really test whether or not this bin refinement worked, you should ideally run checkm again on these filtered bins to see if the metrics improved or not.

Going further

Now that you know how to run metagenome assemblies, binning, refinement and checking of bin qualities. Run metagenomic binning on contigs produced by the megahit assembly. Then perform checkm to see how good the bins are in comparison to those assembled with metaSPAdes.

Things for you to find out after you complete the exercises

At the end of all these exercises, which may take over a few days to complete, you should find out the following.

  1. How many contigs did your metaSPAdes metagenome assembly produce?

  2. How many contigs did your megahit metagenome assembly produce?

  3. What is the size of the largest contig in each of these assemblies?

  4. What did the genome completeness and contamination results from CheckM tell you about these 2 assemblers?

  5. Which assembler is faster to run a metagenome assembly?