{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises for Week 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recap job submission on Cerberus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before moving forward, I will be going over with you again on how to install tools and run jobs on Cerberus, a remote computer you have been connecting to run some of the more computationally intensive jobs.\n",
    "\n",
    "**Important things to remember**\n",
    "\n",
    "- When running computational tasks on `cerberus`, do not just copy-paste the command examples (you need to use an sbatch job submission script to submit jobs)\n",
    "- To create an sbatch script, you have a few options:\n",
    "  1. Connect to `cerberus` remotely and create the sbatch script using a command line text editor such as `vim`, `emacs`, `nano`, `pico`, etc. This is a preferred way of creating an sbatch script.\n",
    "  2. Create the sbatch script on your laptop computer using one of the text editors mentioned above and uploading the sbatch script to your exercises folder on `cerberus`. Beware, files created using Windows computer may not be compatible with Linux systems and they may cause problems when you try to submit the job on `cerberus`. Sometimes, if you create a file on a Windows computer, you may have to convert it to Unix-compatible format using the `dos2unix` command. This command may or may not be present on the remote computer. If it is not present, then you will have to install it using `conda`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install tools needed for today's exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to install a few tools that will be needed for today's exercises. They are not already installed on Cerberus and therefore you cannot use the `module load` command to load them in your sbatch script. So you need to install them through the `conda` command and then run them. Once they are installed using `conda`, you should be able to run them just like you would run tools on your laptop. You do not need to specify `module load` for tools installed using `conda`.\n",
    "\n",
    "The list of tools you should install on `cerberus` using the `conda` command are:\n",
    "\n",
    "    - megahit\n",
    "    - unicycler\n",
    "    - bbmap\n",
    "    - samtools\n",
    "    - sra-tools\n",
    "    - seqtk\n",
    "    - filtlong\n",
    "    - metabat2\n",
    "    - checkm-genome\n",
    " \n",
    "Search for them and install them using `conda`:\n",
    " \n",
    "```bash\n",
    "conda install megahit\n",
    "conda install unicycler\n",
    "conda install bbmap\n",
    "conda install samtools\n",
    "conda install sra-tools\n",
    "conda install seqtk\n",
    "conda install filtlong\n",
    "conda install metabat2\n",
    "conda install checkm-genome \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run metagenome assemblies on `cerberus`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have run SPAdes on your laptop and on a remote computer last week. This week, you will be using SPAdes and specifically metaSPAdes to run a metagenome assembly on `cerberus`. SPAdes is a genome assembler that was originally design to assemble single-cell genomic data but has since evolved to become a multipurpose assembler able to handle a wide variety of sequencing data. The authors who developed the tool have written a specific version of SPAdes known as \"**metaSPAdes**\" that you will be using today in class. See here for the paper describing metaSPAdes: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/.\n",
    "\n",
    "The metagenomic data you downloaded last week (and should have uploaded to `cerberus` last week) is here: https://www.ncbi.nlm.nih.gov/sra/SRX4741377. The SRA accession number to download this data is SRR7905025. Make sure that you are using this data to run the metagenomic assembly today. \n",
    "\n",
    "It should be stored in your `data` folder on `cerberus` ideally. If your assignment number 3, I have asked you to run `fastqc` to check the sequence quality. If you haven't already done so, please do that now. Next, you will be trimming your raw metagenomic reads using `bbduk.sh` tool you used a few weeks ago to trim away adapter regions and to improve sequence quality. You should be using the `sbatch` command to submit jobs on `cerberus` and not directly typing the commands in the terminal.\n",
    "\n",
    "#### Trimming raw metagenomic reads\n",
    "\n",
    "An example sbatch script to submit jobs to trim the metagenomic reads is provided below but you must edit them to make sure it runs on your own \"exercise\" directory. *Again*, **do not** copy-paste this copy into your terminal window after connecting to `cerberus`. You need to create an sbatch file using `vim` or something similar, then paste the command into the newly created text file. Make sure that all the files in this example are present in the folders where this sbatch script will be run. Otherwise, edit the file path accordingly.\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o trim.%j.out\n",
    "#SBATCH -e trim.%j.err\n",
    "#SBATCH -N 1\n",
    "#SBATCH -J TRIM\n",
    "#SBATCH -t 2:00:00\n",
    "#SBATCH -p defq\n",
    "\n",
    "bbduk.sh ktrim=r ordered minlen=50 mink=11 rcomp=f k=21 ow=t zl=4 \\\n",
    "    qtrim=rl trimq=20 \\\n",
    "    in1=SRR7905025_1.fastq.gz \\\n",
    "    in2=SRR7905025_2.fastq.gz \\\n",
    "    ref=~/adapters.fa \\\n",
    "    out1=SRR7905025_1.trimmed.fastq.gz \\\n",
    "    out2=SRR7905025_2.trimmed.fastq.gz\n",
    "```    \n",
    "\n",
    "Depending on where this sbatch script is residing, you will run this script directly in the folder where the raw metagenomic data are located. Let's say if it is named as `run_trim.sh`, you will then type `sbatch run_trim.sh` in the `data` folder if the files are there. After `bbduk.sh` tool is done, you can directly use the trimmed fastq files as input for metagenome assemblies.\n",
    "\n",
    "#### Running metagenomic assembly using Megahit\n",
    "We will be using a metagenome assembly tool known as `megahit` to perform a metagenome assembly today. It is a tool that runs a lot faster with less memory requirements than metaSPAdes but known to be less accurate. Because it takes shorter amount of time, you will be able to inspect the assembly results during class. metaSPAdes runs longer so you would submit the job after you have submitted Megahit runs so it will continue to run after class is over.\n",
    "\n",
    "The paper describing Megahit is available here (https://pubmed.ncbi.nlm.nih.gov/25609793/) and the github page can be found here (https://github.com/voutcn/megahit). You can also see how to run this assembler at this wiki page (https://sites.google.com/site/wiki4metagenomics/tools/assembly/megahit).\n",
    "\n",
    "An example command to run a metagenome assembly looks like this:\n",
    "\n",
    "```bash\n",
    "megahit -1 SAMPLE_1.fastq  -2 SAMPLE_2.fastq  -m 0.5  -t 12  -o megahit_result\n",
    "```\n",
    "\n",
    "Where `-m 0.5` indicates that the tool should use 50% of the available computer memory and `-t 12` is to use 12 CPUs cores to run this assembly. On `cerberus`, you should run `megahit` through the use of an sbatch script. An example is shown below.\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o mega.%j.out\n",
    "#SBATCH -e mega.%j.err\n",
    "#SBATCH -D /home/jsaw/exercises\n",
    "#SBATCH -J MEGA\n",
    "#SBATCH -N 1\n",
    "#SBATCH -t 4:00:00\n",
    "#SBATCH -p defq\n",
    "\n",
    "megahit \\\n",
    "    -1 ../data/SRR7905025_1.trimmed.fastq.gz \\\n",
    "    -2 ../data/SRR7905025_2.trimmed.fastq.gz \\\n",
    "    -t 16 \\\n",
    "    -o megahit_assembly\n",
    "```\n",
    "\n",
    "Here, I am running this assembly in my `exercises` folder where megahit assembly results will be stored in the subfolder `megahit_assembly`. You should make sure the sbatch script is customized for your folders and file names. It will take less than 30 minutes to run this assembly. While the job is running, continue with the metaSPAdes assembly run.\n",
    "\n",
    "#### Running metagenomic assembly using metaSPAdes\n",
    "\n",
    "You will use the same trimmed fastq files as input to run metaSPAdes. An example sbatch script is shown below but again, you should make sure to change a number of things to make sure it runs on `cerberus`:\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o asm_meta.%j.out\n",
    "#SBATCH -e asm_meta.%j.err\n",
    "#SBATCH -D /home/jsaw/exercises\n",
    "#SBATCH -J ASM_META\n",
    "#SBATCH -N 1\n",
    "#SBATCH -t 4:00:00\n",
    "#SBATCH -p defq\n",
    "\n",
    "module load spades/3.14.1\n",
    "\n",
    "spades.py \\\n",
    "    --meta \\\n",
    "    -m 300 \\\n",
    "    -t 16 \\\n",
    "    --pe1-1 ../data/SRR7905025_1.trimmed.fastq.gz \\\n",
    "    --pe1-2 ../data/SRR7905025_2.trimmed.fastq.gz \\\n",
    "    -k 21,33,55,77 \\\n",
    "    -o spades_meta\n",
    "```\n",
    "\n",
    "You will notice that the commands are actually very similar to what you used to assemble *Salmonella enteria* genome. The only major difference here is that you are specifying the `--meta` flag to indicate that you are running a metagenomic assembly. This flag is crucial and would mean a big difference between assembling a single genome versus multiple genomes from a community (i.e., metagenomes). I am also specifying the `-m 300` to indicate that there is 300 GB memory limit with the compute nodes on `cerberus`. \n",
    "\n",
    "**Warning**: It will take a while to run and complete metaSPAdes due to the large metagenomic data size. In my test run, it took about 3 hours to complete. So you will not be able to inspect the results until after the class is over but you should still submit the job to put it in the job queue. If the job did not finish within the 4 hours allowed, there is an option to continue the assembly. `SPAdes` can continue from the last check point based on the log and intermediate files it produced in a previous run.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Checking assembly metrics using Quast"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should be able to inspect the metagenome assembly produced by `megahit` during class but you would have to inspect the metaSPAdes assembly results most likely after the class. Remember the `quast.py` command you used last week to check assembly metrics? Use the same type of commands. You can either run Quast on `cerberus` or you can first download the assembly results to your laptop and then run Quast on your local computer. It is up to you. However, you will still need to download the results to your laptop to view some of the files (such as PDF files or html files showing interactive plots).\n",
    "\n",
    "I will not display the `rsync` command to download files now because some of you have not completed the assignment 3 and I do not want to give away the answers here. I want you to first figure out how to download files to your laptop. \n",
    "\n",
    "#### Checking megahit assembly results\n",
    "\n",
    "I ran a `megahit` assembly on `cerberus` and it produced a folder name `megahit_assembly`. If you look into the folder, you will see the following files:\n",
    "\n",
    "![Megahit result](images/megahit_res.png)\n",
    "\n",
    "You will need to run Quast on the files named `final.contigs.fa` file.\n",
    "\n",
    "#### Checking metaSPAdes assembly results\n",
    "\n",
    "As for metaSPAdes assembly, you should look into the output folder produced by metaSPAdes and run Quast on the `contigs.fasta` file. Make sure you run Quast on both Megahit and metaSPAdes assemblies. Then compare assembly metrics produced by the two different assemblers. What do you see?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download data for unicycler assembly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Today, in addition to running metagenomic assemblies, you will learn how to run hybrid assemblies using both short-read Illumina and long-read Nanopore sequences. The tool we will be using for this hybrid assembly is known as \"**Unicycler**\" and the paper describing the tool is found here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005595 and the Github wiki page to learn more about the tool is found here: https://github.com/rrwick/Unicycler. \n",
    "\n",
    "This tool was specifically written to be able to handle long-read Nanopore sequences that can reach well over several thousands or million of bases in some cases. This tool, in conjunction with the long-read Nanopore data, makes it possible to assemble complex genomes with highly repetitive sequence regions that would otherwise interfere with correct genome assembly. It also makes it possible to assemble near-complete genomes due to long reads that can help to span multiple gap regions that usualy exist in normal short-read assemblies.\n",
    "\n",
    "Today, we will use publicly available hybrid data used in a recent study to assemble the complete genome of *Janthinobacterium lividum* EIF2 published in this journal: https://academic.oup.com/gbe/advance-article/doi/10.1093/gbe/evaa148/5870831\n",
    "\n",
    "The raw data have been deposited to NCBI at the following links:\n",
    "\n",
    "https://www.ncbi.nlm.nih.gov/sra/SRR11070452\n",
    "\n",
    "https://www.ncbi.nlm.nih.gov/sra/SRR11070453\n",
    "\n",
    "Download those two files to your `data` folder on `cerberus`. Use the `fasterq-dump` command which will be available after installing the `sra-tools` through `conda`.\n",
    "\n",
    "Go to your `data` folder and create an sbatch script that contains these commands:\n",
    "\n",
    "```bash\n",
    "fasterq-dump --split-3 SRR11070452\n",
    "\n",
    "fasterq-dump --split-3 SRR11070453\n",
    "```\n",
    "\n",
    "Then submit the sbatch script to run the `fasterq-dump` tool. At the end of the download process, you should have 2 fastq files that are from Illumina sequencing and 1 fastq file from Nanopore sequencing. You would still need to perform read trimming on those reads. Use the `bbduk.sh` example I have shown for trimming metagenomic reads (scroll up) and write an sbatch script to run read trimming of Illumina read using `bbduk.sh`. \n",
    "\n",
    "To perform read trimming of Nanopore reads by quality, we use the tool `filtlong`. See here: https://github.com/rrwick/Filtlong\n",
    "\n",
    "An example command to trim Nanopore reads is shown below:\n",
    "\n",
    "```bash\n",
    "filtlong --min_length 1000 --keep_percent 90 reads.fastq.gz > reads.trimmed.fastq.gz\n",
    "```\n",
    "\n",
    "Submit an sbatch script containing a command similar to above to trim your Nanopore reads.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run unicycler assembly on cerberus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After downloading the data and performing read trimming, you should be able to run `unicycler` on `cerberus`. Again, make sure to submit the job using an sbatch script. An example script is shown below.\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o assm.%j.out\n",
    "#SBATCH -e assm.%j.err\n",
    "#SBATCH -p defq\n",
    "#SBATCH -N 1\n",
    "#SBATCH -D /home/jsaw/exercises\n",
    "#SBATCH -J ASSM\n",
    "#SBATCH -t 4:00:00\n",
    "\n",
    "unicycler \\\n",
    "    -1 ../data/hybrid/SRR11070453_1.fastq.gz \\\n",
    "    -2 ../data/hybrid/SRR11070453_2.fastq.gz \\\n",
    "    -l ../data/hybrid/SRR11070452.fastq.gz \\\n",
    "    -t 24 \\\n",
    "    -o unicycler_assembly\n",
    "```\n",
    "\n",
    "The `-1` and `-2` are to specify the Illumina read pairs and the `-l` is to specify the Nanopore read. The `-t 24` is telling the assembler to use 24 CPU cores and the `-o unicycler_assembly` flag is to redirect output to that folder. Submit an sbatch script similar to this one and it will run for a few hours. Unicycler also uses SPAdes during certain stages to correct errors usually present in Nanopore reads using the Illumina data. Let it run and you can inspect the results after the class is over.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Metabat2 on your own after class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Metabat2 is a high-throughput tool to perform metagenomic binning using differential coverage binning approach. It ideally requires multiple metagenomic samples to accurately bin (separate) contigs produced by metagenome assemblers into respective bins that should theoretically belong to individual organisms. You can read the paper about the tool here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662567/\n",
    "\n",
    "And the code repository where they explain how to use it is located here: https://bitbucket.org/berkeleylab/metabat/src/master/\n",
    "\n",
    "After metaSPAdes assembly is finished, you will use Metabat2 to perform binning of contigs produced by the assembler. There are several steps involved in binning metagenomes and I have written down these steps in a minimal example as shown in the **sbatch script** below. You will use this sbatch script as an example to run Metabat2 run on your own (do this on `cerberus`). **Again**, please do not copy-paste this command on your terminal window on Cerberus. You will need to create an sbatch script to submit a job to run a proper computational job on the remote computer.\n",
    "\n",
    "```bash\n",
    "#!/bin/sh\n",
    "#SBATCH -o binning.%j.out\n",
    "#SBATCH -e binning.%j.err\n",
    "#SBATCH -p defq\n",
    "#SBATCH -N 1\n",
    "#SBATCH -D /home/jsaw/exercises\n",
    "#SBATCH -J binning\n",
    "#SBATCH -t 4:00:00\n",
    "\n",
    "## extract contigs\n",
    "cd spades_meta\n",
    "seqtk comp contigs.fasta | awk '{if($2 >= 1000) print $1}' > contigs_gt1kb.list\n",
    "seqtk subseq contigs.fasta contigs_gt1kb.list > contigs_gt1kb.fasta\n",
    "\n",
    "## run mapping and sorting bam files\n",
    "mkdir binning\n",
    "cd binning\n",
    "bbwrap.sh \\\n",
    "    ref=../contigs_gt1kb.fasta \\\n",
    "    in1=../../../data/SRR7905025_1.trimmed.fastq.gz \\\n",
    "    in2=../../../data/SRR7905025_2.trimmed.fastq.gz \\\n",
    "    out=SRR7905025.bam \\\n",
    "    t=16 \\\n",
    "    nodisk\n",
    "\n",
    "samtools sort -O bam -@ 16 SRR7905025.bam > SRR7905025.sorted.bam\n",
    "rm SRR7905025.bam\n",
    "\n",
    "## summarize bam files\n",
    "jgi_summarize_bam_contig_depths \\\n",
    "    --outputDepth depth_min1500.txt \\\n",
    "    --pairedContigs paired_min1500.txt \\\n",
    "    --minContigLength 1500 \\\n",
    "    --minContigDepth 2 *.sorted.bam\n",
    "\n",
    "## run metabat\n",
    "metabat2 -i ../contigs_gt1kb.fasta -a depth_min1500.txt -o bins/bin -t 24\n",
    "\n",
    "```\n",
    "\n",
    "As you can see in the example commands, you need to use a number of tools here. For example, `seqtk`, `bbwrap.sh`, `samtools`, `jgi_summarize_bam_contig_depths`, and `metabat2` are new commands you are not familiar with. Before they are avaiable in your commandline prompt on Cerberus, you will need to make sure you install the following tools using the `conda` command.\n",
    "\n",
    "  - `seqtk`\n",
    "  - `bbmap`\n",
    "  - `samtools`\n",
    "  - `metabat2`\n",
    "\n",
    "More instructions will be posted later on interpreting the binning results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run CheckM on your own after class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CheckM is a commonly used tool to check genome completeness and contamination levels mostly in metagenome-assembled genomes (MAGs). The paper describing the tool and its usage can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387/.\n",
    "\n",
    "Briefly, the tool checks for conserved single-copy marker genes present in microbial (bacterial and archaeal) genomes to determine how complete a genome assembly is and if there are any signs of contamination. Contamination is especially important factor to consider in metagenome-assembled genomes because binning tools are not perfect and they can sometimes put contigs originating from two organisms into the same bin, thereby creating an artifical genome construct that is not real. This will cause a lot of problems downstream when you are analyzing these genomes and discover that some contigs that don't belong are showing up in your analysis and thus throwing of your interpretation of the gene contents or other things.\n",
    "\n",
    "Therefore, it is imperative that you carefully check for signs of contamination and to inspect MAGs that might have been contaminated through the binning process. CheckM is therefore useful for quickly assessing both completeness and contamination levels of assembled genomes. To install `checkm`, you can type `conda install checkm-genome` first to install it through `conda`. \n",
    "\n",
    "After this step, you need to download and configure the reference database it needs to look up when anaylzing the MAGs you have obtained. The reference data can be downloaded here:\n",
    "\n",
    "https://data.ace.uq.edu.au/public/CheckM_databases/\n",
    "\n",
    "On `cerberus`, I downloaded the file in a folder named as `checkm_data` by typing:\n",
    "\n",
    "```bash\n",
    "wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz\n",
    "```\n",
    "\n",
    "Then to extract and uncompress data from this zip archive file, I typed:\n",
    "\n",
    "```bash\n",
    "tar -xzf checkm_data_2015_01_16.tar.gz\n",
    "```\n",
    "\n",
    "This command will release a bunch of files that `checkm` needs to correctly run. After typing the `tar` command, I type the following to set this folder as the root folder to contain all the reference data.\n",
    "\n",
    "```bash\n",
    "checkm data setRoot .\n",
    "```\n",
    "\n",
    "This should be it to correctly install and set up `checkm` to run. You should first make sure you have results available to analyze from your metagenome binning work. This means you need to first make sure that Metabat2 run correctly and then you can run `checkm` on the results.\n",
    "\n",
    "#### Checking completeness and contamination of MAGs\n",
    "\n",
    "To check completeness and contamination of MAGs obtained by Metabat2 run, go into the directory where the results are located. For example, I would go into the directory where I ran the metaSPAdes assembly, which is located here:\n",
    "\n",
    "```bash\n",
    "cd ~/exercises/spades_meta\n",
    "```\n",
    "\n",
    "There is a subfolder named `binning` that was created as a result of the Metabat2 run I just did. In this `binning` folder, there should be another subfolder named `bins`. This folder should contain metagenome-assembled genome bins with extension \".fa\". You should create an sbatch script in the `spades_meta` folder, for example that looks like this:\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o checkm.%j.out\n",
    "#SBATCH -e checkm.%j.err\n",
    "#SBATCH -p defq\n",
    "#SBATCH -N 1\n",
    "#SBATCH -D /home/jsaw/exercises/spades_meta\n",
    "#SBATCH -J CHECKM\n",
    "#SBATCH -t 4:00:00\n",
    "\n",
    "checkm lineage_wf -f CheckM.txt -t 24 -x fa binning/bins checkm_out\n",
    "```\n",
    "\n",
    "So this sbatch script is directing `checkm` to look into the `binning/bins` subfolder where the MAGs are located. The `-x fa` just means to look for all files with extension \".fa\". It will run using 24 CPU cores as indicated by the `-t 24` option. It takes a while to get results in but at the end of the process, you wil have a text file name `CheckM.txt` which contains completeness and contamination estimates for the MAGs. You can see the contents of this text file by typing `cat CheckM.txt` to print file content to your terminal. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run RefineM on your own after class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The research group developed CheckM also developed a tool known as RefineM to automatically detect and remove contigs in metagenome-assembled genomes (MAGs) that may otherwise require manual intervention. This is very helpful if you have hundreds or thousands of MAGs to analyze. This tool automates the process. However, manual bin refinement is still important and should not be overlooked, especially for evolutionarily significant lineages that require careful analysis. You should have already installed RefineM tool using `conda`. I have provided an example sbatch script below on how to run `refinem` on `cerberus`.\n",
    "\n",
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH -o refinem.%j.out\n",
    "#SBATCH -e refinem.%j.err\n",
    "#SBATCH -p defq\n",
    "#SBATCH -N 1\n",
    "#SBATCH -D /home/jsaw/exercises/spades_meta\n",
    "#SBATCH -J ASSM\n",
    "#SBATCH -t 4:00:00\n",
    "\n",
    "cd binning/\n",
    "samtools index -@ 4 SRR7905025.sorted.bam\n",
    "cd ../\n",
    "\n",
    "refinem scaffold_stats -c 4 contigs_gt1kb.fasta binning/bins refinem_out binning/*.sorted.bam --genome_ext fa\n",
    "refinem outliers refinem_out/scaffold_stats.tsv refinem_out\n",
    "refinem filter_bins binning/bins refinem_out/outliers.tsv filtered_bins --genome_ext fa\n",
    "```\n",
    "\n",
    "This script assumes that you have already run metagenomic binning using Metabat2. The files that `refinem` are expecting have been produced as a result of `metabat2` run that was previously performed. At the end of this process, you will have a folder named `filtered_bins` which will contain refined bins/MAGs (same number as produced by Metabat2) that should have cleaner bins with most likely contaminants removed from them. To really test whether or not this bin refinement worked, you should ideally run `checkm` again on these filtered bins to see if the metrics improved or not."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Going further"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you know how to run metagenome assemblies, binning, refinement and checking of bin qualities. Run metagenomic binning on contigs produced by the `megahit` assembly. Then perform `checkm` to see how good the bins are in comparison to those assembled with metaSPAdes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Things for you to find out after you complete the exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the end of all these exercises, which may take over a few days to complete, you should find out the following.\n",
    "\n",
    "1. How many contigs did your metaSPAdes metagenome assembly produce?\n",
    "\n",
    "2. How many contigs did your megahit metagenome assembly produce?\n",
    "\n",
    "3. What is the size of the largest contig in each of these assemblies?\n",
    "\n",
    "4. What did the genome completeness and contamination results from CheckM tell you about these 2 assemblers?\n",
    "\n",
    "5. Which assembler is faster to run a metagenome assembly?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}