{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises for today" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will be doing 3 exercises to become familar with some of the bioinformatic tools that are used by researchers today. These tools are currently being used actively by researchers who are studying microbial genomes. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. sra-tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first is sra-tools, which are a suite of tools that you can use to download data from NCBI sequence read archives (SRA) (see here: https://www.ncbi.nlm.nih.gov/sra). This database is essential to a lot of researchers. This is where they will deposit raw sequences that came out of sequencing instruments. You can search for research project focusing on a certain topic and download the data they have deposited to SRA. It is publicly accessible. \n", "\n", "For example, if you type \"Yellowstone hot spring\" in the search box, you will see a list of accessions with links to the data. An example is here:\n", "\n", "https://www.ncbi.nlm.nih.gov/sra/SRX9071323\n", "\n", "If you click on that link, you will see that the study is titled \"Seasonal hydrologic and geologic forcing drive hot spring geochemistry and microbial biodiversity\". And there is a link below that says \"Run, # of spots, # of Bases\" etc. To download this sequence data, you will need to click on the link that starts with \"SRR\", click on the \"Data access\" tab, then click on the link provided next to the \"run\". \n", "\n", "So it becomes very cumbersome and difficult to navigate these pages to access and download the data. This is where the sra-tools becomes very useful. You can download the raw sequence reads deposited to SRA by just typing the following command with the accession number next to the `[accn]`.\n", "\n", "```\n", "fasterq-dump --split-3 SRR12584454\n", "```\n", "\n", "The \"SRR12584454\" is the actual run accession number for that particular study. This is actually the number you need to enter when trying to download these raw sequences from NCBI SRA.\n", "\n", "First, you might want to explore what `fasterq-dump` means by typing `fasterq-dump -h`. `-h` just means you are trying to bring up help menu to list the options available with this command." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Usage:\n", " fasterq-dump [options]\n", "\n", "Options:\n", " -o|--outfile output-file \n", " -O|--outdir output-dir \n", " -b|--bufsize size of file-buffer dflt=1MB \n", " -c|--curcache size of cursor-cache dflt=10MB \n", " -m|--mem memory limit for sorting dflt=100MB \n", " -t|--temp where to put temp. files dflt=curr dir \n", " -e|--threads how many thread dflt=6 \n", " -p|--progress show progress \n", " -x|--details print details \n", " -s|--split-spot split spots into reads \n", " -S|--split-files write reads into different files \n", " -3|--split-3 writes single reads in special file \n", " --concatenate-reads writes whole spots into one file \n", " -Z|--stdout print output to stdout \n", " -f|--force force to overwrite existing file(s) \n", " -N|--rowid-as-name use row-id as name \n", " --skip-technical skip technical reads \n", " --include-technical include technical reads \n", " -P|--print-read-nr print read-numbers \n", " -M|--min-read-len filter by sequence-len \n", " --table which seq-table to use in case of pacbio \n", " --strict terminate on invalid read \n", " -B|--bases filter by bases \n", " -A|--append append to output-file \n", " -h|--help Output brief explanation for the program. \n", " -V|--version Display the version of the program then \n", " quit. \n", " -L|--log-level Logging level as number or enum string. One \n", " of (fatal|sys|int|err|warn|info|debug) or \n", " (0-6) Current/default is warn \n", " -v|--verbose Increase the verbosity of the program \n", " status messages. Use multiple times for more \n", " verbosity. Negates quiet. \n", " -q|--quiet Turn off all status messages for the \n", " program. Negated by verbose. \n", " --option-file Read more options and parameters from the \n", " file. \n", "\n", "/Users/jsaw/miniconda3/bin/fasterq-dump : 2.10.1\n", "\n" ] } ], "source": [ "%%bash\n", "fasterq-dump -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, there are multiple options available with this tool and you can even specify things like memory limitation for this command to run and the number of threads you want to use with this tool. Before actually typing the command, I want you to make sure that you change your directory to \"exercises\" directory that I made you create last week. I am giving you an example of where I created the folders and organize them. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MicrobialGenomicsLab\n", "├── data\n", "├── exercises\n", "├── repositories\n", "└── tools\n", "\n", "4 directories, 0 files\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/\n", "tree MicrobialGenomicsLab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, you will see that I created the 4 folders inside \"MicrobialGenomicsLab\" folder which resides under \"workspace\". Navigate into the exercises folder and run the `fasterq-dump` tool with the example command that I wrote earlier.\n", "\n", "```bash\n", "fasterq-dump --split-3 SRR12584454\n", "```\n", "\n", "The command will start downloading the raw sequence files from SRA to your \"exercises\" folder. It may take a few minutes and you will see the report printed to the screen once it's done. Usually it will not tell you what is going on but you can increase the verbosity of the screen output by typing like this:\n", "\n", "```bash\n", "fasterq-dump -v --split-3 SRR12584454\n", "```\n", "\n", "Once it's done, you can type `ls -la` to see what files the tool produced." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 165888\n", "drwxr-xr-x 4 jsaw 982768932 128 Sep 9 09:05 .\n", "drwxr-xr-x 6 jsaw 982768932 192 Sep 9 08:58 ..\n", "-rw-r--r-- 1 jsaw 982768932 70168140 Sep 9 09:05 SRR12584454_1.fastq\n", "-rw-r--r-- 1 jsaw 982768932 70168140 Sep 9 09:05 SRR12584454_2.fastq\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "ls -la" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you should see 2 files being produced by the `fasterq-dump` tool. Both files end with a \".fastq\" file extension. These are fastq-formatted files that can be observed/analyzed with tools like `fastqc` or `bbmap`. Now, try to see what the contents of these fastq files look like." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@SRR12584454.1 1 length=301\n", "GTGCCAGCCGCCGCGGTAATACCAGCCCCGCGAGTGGTCGGGACTCTTACTGGGCCTAAAGCGCCCGTAGCCGGCCCGACAAGTCACTCCTTAAAGACCCCGGCTCAACCGGGGGAATGGGGGTGATACTGTCGGGCTAGGGGGCGGAAGAGGCCAGCGGTACTCCCGGAGTAGGGGCGAAATCCTCAGATCCCGGGAGGACCACCAGTGGCGAAAGCGGCTGGCTAGAACGCGCCCGACGGTGGGGGGCGAAAGGCGGGGCAGAGAAAGGGATTAGAAAACCCTTGAGGTCAGATGGGAA\n", "+SRR12584454.1 1 length=301\n", "CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@FCFGGGGGGGGGGGGGGGGGGGAFGGGGGGGEGGCCGGGGGCCGGGGGGGGGECGCFFGGGGGGGEFGGGGGGGGC6:CGGCGEEGGGGGGGGGG=:8EEC6C:C:?C*2<:++@F3DFGG,=3FG9,@FG:3CF>FGGGGG,FG,?FCGC5:D3C>1:/3>DDF@118?C*7;C?DFB?EAE5\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "head -8 SRR12584454_1.fastq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I typed `head -8 SRR12584454_1.fastq` to print the first 8 lines of the fastq file to the terminal screen. As you can see, the file stores a string of letters representing the bases of DNA and other weird characters. Try to see if you can see a pattern here. Each record for a single fragment of DNA is represented by 4 lines. \n", "\n", "The first line starts with a '@' followed by a string of alphabets and numbers. This should be unique to each sequence record. The actual DNA sequence is on the 2nd line. The third line starts with '+' and same identifier as the first line. The 4th line contains characters that represent sequence quality for each of the DNA bases shown on the 2nd line. This is very important information for tools like `fastqc` or `bbmap` that relies on this information to assess sequence quality.\n", "\n", "To understand more about how fastq files are encoded, see here: https://en.wikipedia.org/wiki/FASTQ_format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You see 2 fastq files that end with \"\\_1\" and \"\\_2\" in their file names. The reason for that is Illumina sequences that produced these files is usually run to sequence two ends of a single DNA fragment and for each DNA fragment that was sequenced, you have two sequences that you know are physically linked and therefore we call them **\"read pairs\"**. Pairing information is very useful for genome assembly and mapping. We will come to that in later labs.\n", "\n", "Now that you have 2 fastq files, what do we do with them? First, what would you do if you want to know how many sequences are in these files? You can type something like this:\n", "\n", "```bash\n", "grep -c \"@SRR12584454\" SRR12584454_1.fastq\n", "grep -c \"@SRR12584454\" SRR12584454_2.fastq\n", "```\n", "\n", "And both should return the same numbers. I will show you the example below. These are relatively small files by Illumina sequencing standards (only about 67 Mbp) so it is ok to run this `grep` command. I would not recommend this with files larger than several gigabytes large. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "103842\n", "103842\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "grep -c \"@SRR12584454\" SRR12584454_1.fastq\n", "grep -c \"@SRR12584454\" SRR12584454_2.fastq\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, I am searching for the pattern \"@SRR12584454\" in each fastq file because I know that this identifier is present in each of the header line of the sequences. `grep -c` just counts the occurrences of these identifiers instead of printing the matches to screen. I see that both fastq files contain 103842 sequences. This is very small number. Usually you will have millions of sequences per fastq file. So there is no way for you to manually inspect each and every one of these sequences for their quality. This is where the FastQC tool comes in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. FastQC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FastQC tool can be found here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ but also through `conda` and you should have installed it through `conda` by now.\n", "\n", "It is a tool to analyze high-throughput sequencing data such as those produced by Illumina sequencing technology. It exists in both graphical and command line modes. After installation, if you just type `fastqc` without any parameters, it will bring up the graphical interface. For bioinformaticians, however, we like to work in command line mode as much as possible due to large number of files that we usually need to process in automated fashion. Today, we will use fastqc to inspect the sequence quality of these 2 files you just downloaded from SRA.\n", "\n", "In Unix environment, using this command on multiple sequences becomes easier because you can use wild card characters. For example, if you want to run fastqc and create reports that can be viewed, you can just type like this:\n", "\n", "```bash\n", "fastqc *.fastq\n", "```\n", "\n", "This means I am telling the `fastqc` tool to process any files in a given directory that ends with \".fastq\" extension. I will show this below." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Analysis complete for SRR12584454_1.fastq\n", "Analysis complete for SRR12584454_2.fastq\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Started analysis of SRR12584454_1.fastq\n", "Approx 5% complete for SRR12584454_1.fastq\n", "Approx 10% complete for SRR12584454_1.fastq\n", "Approx 15% complete for SRR12584454_1.fastq\n", "Approx 20% complete for SRR12584454_1.fastq\n", "Approx 25% complete for SRR12584454_1.fastq\n", "Approx 30% complete for SRR12584454_1.fastq\n", "Approx 35% complete for SRR12584454_1.fastq\n", "Approx 40% complete for SRR12584454_1.fastq\n", "Approx 45% complete for SRR12584454_1.fastq\n", "Approx 50% complete for SRR12584454_1.fastq\n", "Approx 55% complete for SRR12584454_1.fastq\n", "Approx 60% complete for SRR12584454_1.fastq\n", "Approx 65% complete for SRR12584454_1.fastq\n", "Approx 70% complete for SRR12584454_1.fastq\n", "Approx 75% complete for SRR12584454_1.fastq\n", "Approx 80% complete for SRR12584454_1.fastq\n", "Approx 85% complete for SRR12584454_1.fastq\n", "Approx 90% complete for SRR12584454_1.fastq\n", "Approx 95% complete for SRR12584454_1.fastq\n", "Started analysis of SRR12584454_2.fastq\n", "Approx 5% complete for SRR12584454_2.fastq\n", "Approx 10% complete for SRR12584454_2.fastq\n", "Approx 15% complete for SRR12584454_2.fastq\n", "Approx 20% complete for SRR12584454_2.fastq\n", "Approx 25% complete for SRR12584454_2.fastq\n", "Approx 30% complete for SRR12584454_2.fastq\n", "Approx 35% complete for SRR12584454_2.fastq\n", "Approx 40% complete for SRR12584454_2.fastq\n", "Approx 45% complete for SRR12584454_2.fastq\n", "Approx 50% complete for SRR12584454_2.fastq\n", "Approx 55% complete for SRR12584454_2.fastq\n", "Approx 60% complete for SRR12584454_2.fastq\n", "Approx 65% complete for SRR12584454_2.fastq\n", "Approx 70% complete for SRR12584454_2.fastq\n", "Approx 75% complete for SRR12584454_2.fastq\n", "Approx 80% complete for SRR12584454_2.fastq\n", "Approx 85% complete for SRR12584454_2.fastq\n", "Approx 90% complete for SRR12584454_2.fastq\n", "Approx 95% complete for SRR12584454_2.fastq\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "fastqc *.fastq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what you would see if you type the commands in your terminal. You can now inspect what files are being produced after the `fastqc` command was run." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 169600\n", "drwxr-xr-x 8 jsaw 982768932 256 Sep 9 09:41 .\n", "drwxr-xr-x 6 jsaw 982768932 192 Sep 9 08:58 ..\n", "-rw-r--r-- 1 jsaw 982768932 70168140 Sep 9 09:05 SRR12584454_1.fastq\n", "-rw-r--r-- 1 jsaw 982768932 974550 Sep 9 09:41 SRR12584454_1_fastqc.html\n", "-rw-r--r-- 1 jsaw 982768932 881128 Sep 9 09:41 SRR12584454_1_fastqc.zip\n", "-rw-r--r-- 1 jsaw 982768932 70168140 Sep 9 09:05 SRR12584454_2.fastq\n", "-rw-r--r-- 1 jsaw 982768932 973629 Sep 9 09:41 SRR12584454_2_fastqc.html\n", "-rw-r--r-- 1 jsaw 982768932 867986 Sep 9 09:41 SRR12584454_2_fastqc.zip\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "ls -la" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the command produced an \"html\" and a \"zip\" file for each fastq. Now, open the html files using your web browser. You should see something similar to this:\n", "\n", "![fastqc](images/fqc.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is just a screenshot of the very beginning of the report. As you scroll down, you will see more information. Basically, the check marks under \"Summary\" will tell you whether the sequences are good or bad. If you see multiple red crosses, that means the sequences are of poor or bad quality and shouldn't be used right away until further processing is done. An example of a good and a bad sequencing run are shown here:\n", "\n", "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html\n", "\n", "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html\n", "\n", "A few things to note about the histograms in the first plot. You can see that each sequence is 301 basepair long and the histogram is basically trying to depict average sequence quality within a given window of sequence. You will notice that average sequence quality starts to drop as you go towards the end of the sequence. This is due to the nature of Illumina sequencing technology. It uses fluorescent molecules to record unique bases (A, C, G, T), and the fluorescence signal fades towards the later cycles of sequencing. This makes it difficult to confidently assign correct letters of DNA towards the end and the instrument records lower read qualities near the end.\n", "\n", "Another thing you want to watch out for is the presence of \"adapter\" sequences (near the bottom of the report). See an example of bad sequence report. If the adapter sequences (which are artificial DNA constructs to facilitate sequencing) are left in the sequences for whatever reason, `fastqc` will detect it. In this bad sequence example, you will notice that the adapter content histogram starts to increase near the end of the sequences. If you see something like this, you will need to remove the adapter sequences before the sequences can be used in genome assemblies or other analyses.\n", "\n", "This is where the `bbmap` tool comes in.\n", "\n", "A very useful website on how to read and interpret sequencing quality assessments: https://sequencing.qcfail.com/articles/?report=reader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. BBTools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BBTools is a collection of tools written by scientists at the Joint Genome Institute (JGI) in Berkeley, CA. See here: https://jgi.doe.gov/data-and-tools/bbtools/\n", "\n", "It contains a number of tools that can perform tasks such as interconversion of file formats, processing of raw sequencing files into usable ones, mapping of sequence reads to reference genomes, etc. The tool we will be using today is `bbduk`, which is meant for filtering and trimming of reads for adapter, contaminants, and quality using k-mers. Before you can actually use this tool, you need a reference file containing all known adapter sequences. `bbduk` can then look up for these sequences to know if it can find these contaminants in your sequences.\n", "\n", "First, download this adapter file here:\n", "\n", "https://www.dropbox.com/s/f5mydteoupt8ugb/adapters.fa?dl=0\n", "\n", "I suggest you put this \"adapter.fa\" file somewhere safe where there is no likelihood of it being deleted by accident.\n", "\n", "Now, you can use `bbduk` to perform quality trimming and contaminant removal. To see what options are available with `bbduk`, type:\n", "\n", "`bbduk.sh -h`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Written by Brian Bushnell\n", "Last modified March 24, 2020\n", "\n", "Description: Compares reads to the kmers in a reference dataset, optionally \n", "allowing an edit distance. Splits the reads into two outputs - those that \n", "match the reference, and those that don't. Can also trim (remove) the matching \n", "parts of the reads rather than binning the reads.\n", "Please read bbmap/docs/guides/BBDukGuide.txt for more information.\n", "\n", "Usage: bbduk.sh in= out= ref=\n", "\n", "Input may be stdin or a fasta or fastq file, compressed or uncompressed.\n", "If you pipe via stdin/stdout, please include the file type; e.g. for gzipped \n", "fasta input, set in=stdin.fa.gz\n", "\n", "Input parameters:\n", "in= Main input. in=stdin.fq will pipe from stdin.\n", "in2= Input for 2nd read of pairs in a different file.\n", "ref= Comma-delimited list of reference files.\n", " In addition to filenames, you may also use the keywords:\n", " adapters, artifacts, phix, lambda, pjet, mtst, kapa\n", "literal= Comma-delimited list of literal reference sequences.\n", "touppercase=f (tuc) Change all bases upper-case.\n", "interleaved=auto (int) t/f overrides interleaved autodetection.\n", "qin=auto Input quality offset: 33 (Sanger), 64, or auto.\n", "reads=-1 If positive, quit after processing X reads or pairs.\n", "copyundefined=f (cu) Process non-AGCT IUPAC reference bases by making all\n", " possible unambiguous copies. Intended for short motifs\n", " or adapter barcodes, as time/memory use is exponential.\n", "samplerate=1 Set lower to only process a fraction of input reads.\n", "samref= Optional reference fasta for processing sam files.\n", "\n", "Output parameters:\n", "out= (outnonmatch) Write reads here that do not contain \n", " kmers matching the database. 'out=stdout.fq' will pipe \n", " to standard out.\n", "out2= (outnonmatch2) Use this to write 2nd read of pairs to a \n", " different file.\n", "outm= (outmatch) Write reads here that fail filters. In default\n", " kfilter mode, this means any read with a matching kmer.\n", " In any mode, it also includes reads that fail filters such\n", " as minlength, mingc, maxgc, entropy, etc. In other words,\n", " it includes all reads that do not go to 'out'.\n", "outm2= (outmatch2) Use this to write 2nd read of pairs to a \n", " different file.\n", "outs= (outsingle) Use this to write singleton reads whose mate \n", " was trimmed shorter than minlen.\n", "stats= Write statistics about which contamininants were detected.\n", "refstats= Write statistics on a per-reference-file basis.\n", "rpkm= Write RPKM for each reference sequence (for RNA-seq).\n", "dump= Dump kmer tables to a file, in fasta format.\n", "duk= Write statistics in duk's format. *DEPRECATED*\n", "nzo=t Only write statistics about ref sequences with nonzero hits.\n", "overwrite=t (ow) Grant permission to overwrite files.\n", "showspeed=t (ss) 'f' suppresses display of processing speed.\n", "ziplevel=2 (zl) Compression level; 1 (min) through 9 (max).\n", "fastawrap=70 Length of lines in fasta output.\n", "qout=auto Output quality offset: 33 (Sanger), 64, or auto.\n", "statscolumns=3 (cols) Number of columns for stats output, 3 or 5.\n", " 5 includes base counts.\n", "rename=f Rename reads to indicate which sequences they matched.\n", "refnames=f Use names of reference files rather than scaffold IDs.\n", "trd=f Truncate read and ref names at the first whitespace.\n", "ordered=f Set to true to output reads in same order as input.\n", "maxbasesout=-1 If positive, quit after writing approximately this many\n", " bases to out (outu/outnonmatch).\n", "maxbasesoutm=-1 If positive, quit after writing approximately this many\n", " bases to outm (outmatch).\n", "json=f Print to screen in json format.\n", "\n", "Histogram output parameters:\n", "bhist= Base composition histogram by position.\n", "qhist= Quality histogram by position.\n", "qchist= Count of bases with each quality value.\n", "aqhist= Histogram of average read quality.\n", "bqhist= Quality histogram designed for box plots.\n", "lhist= Read length histogram.\n", "phist= Polymer length histogram.\n", "gchist= Read GC content histogram.\n", "enthist= Read entropy histogram.\n", "ihist= Insert size histogram, for paired reads in mapped sam.\n", "gcbins=100 Number gchist bins. Set to 'auto' to use read length.\n", "maxhistlen=6000 Set an upper bound for histogram lengths; higher uses \n", " more memory. The default is 6000 for some histograms\n", " and 80000 for others.\n", "\n", "Histograms for mapped sam/bam files only:\n", "histbefore=t Calculate histograms from reads before processing.\n", "ehist= Errors-per-read histogram.\n", "qahist= Quality accuracy histogram of error rates versus quality \n", " score.\n", "indelhist= Indel length histogram.\n", "mhist= Histogram of match, sub, del, and ins rates by position.\n", "idhist= Histogram of read count versus percent identity.\n", "idbins=100 Number idhist bins. Set to 'auto' to use read length.\n", "varfile= Ignore substitution errors listed in this file when \n", " calculating error rates. Can be generated with\n", " CallVariants.\n", "vcf= Ignore substitution errors listed in this VCF file \n", " when calculating error rates.\n", "ignorevcfindels=t Also ignore indels listed in the VCF.\n", "\n", "Processing parameters:\n", "k=27 Kmer length used for finding contaminants. Contaminants \n", " shorter than k will not be found. k must be at least 1.\n", "rcomp=t Look for reverse-complements of kmers in addition to \n", " forward kmers.\n", "maskmiddle=t (mm) Treat the middle base of a kmer as a wildcard, to \n", " increase sensitivity in the presence of errors.\n", "minkmerhits=1 (mkh) Reads need at least this many matching kmers \n", " to be considered as matching the reference.\n", "minkmerfraction=0.0 (mkf) A reads needs at least this fraction of its total\n", " kmers to hit a ref, in order to be considered a match.\n", " If this and minkmerhits are set, the greater is used.\n", "mincovfraction=0.0 (mcf) A reads needs at least this fraction of its total\n", " bases to be covered by ref kmers to be considered a match.\n", " If specified, mcf overrides mkh and mkf.\n", "hammingdistance=0 (hdist) Maximum Hamming distance for ref kmers (subs only).\n", " Memory use is proportional to (3*K)^hdist.\n", "qhdist=0 Hamming distance for query kmers; impacts speed, not memory.\n", "editdistance=0 (edist) Maximum edit distance from ref kmers (subs \n", " and indels). Memory use is proportional to (8*K)^edist.\n", "hammingdistance2=0 (hdist2) Sets hdist for short kmers, when using mink.\n", "qhdist2=0 Sets qhdist for short kmers, when using mink.\n", "editdistance2=0 (edist2) Sets edist for short kmers, when using mink.\n", "forbidn=f (fn) Forbids matching of read kmers containing N.\n", " By default, these will match a reference 'A' if \n", " hdist>0 or edist>0, to increase sensitivity.\n", "removeifeitherbad=t (rieb) Paired reads get sent to 'outmatch' if either is \n", " match (or either is trimmed shorter than minlen). \n", " Set to false to require both.\n", "trimfailures=f Instead of discarding failed reads, trim them to 1bp.\n", " This makes the statistics a bit odd.\n", "findbestmatch=f (fbm) If multiple matches, associate read with sequence \n", " sharing most kmers. Reduces speed.\n", "skipr1=f Don't do kmer-based operations on read 1.\n", "skipr2=f Don't do kmer-based operations on read 2.\n", "ecco=f For overlapping paired reads only. Performs error-\n", " correction with BBMerge prior to kmer operations.\n", "recalibrate=f (recal) Recalibrate quality scores. Requires calibration\n", " matrices generated by CalcTrueQuality.\n", "sam= If recalibration is desired, and matrices have not already\n", " been generated, BBDuk will create them from the sam file.\n", "amino=f Run in amino acid mode. Some features have not been\n", " tested, but kmer-matching works fine. Maximum k is 12.\n", "\n", "Speed and Memory parameters:\n", "threads=auto (t) Set number of threads to use; default is number of \n", " logical processors.\n", "prealloc=f Preallocate memory in table. Allows faster table loading \n", " and more efficient memory usage, for a large reference.\n", "monitor=f Kill this process if it crashes. monitor=600,0.01 would \n", " kill after 600 seconds under 1% usage.\n", "minrskip=1 (mns) Force minimal skip interval when indexing reference \n", " kmers. 1 means use all, 2 means use every other kmer, etc.\n", "maxrskip=1 (mxs) Restrict maximal skip interval when indexing \n", " reference kmers. Normally all are used for scaffolds<100kb, \n", " but with longer scaffolds, up to maxrskip-1 are skipped.\n", "rskip= Set both minrskip and maxrskip to the same value.\n", " If not set, rskip will vary based on sequence length.\n", "qskip=1 Skip query kmers to increase speed. 1 means use all.\n", "speed=0 Ignore this fraction of kmer space (0-15 out of 16) in both\n", " reads and reference. Increases speed and reduces memory.\n", "Note: Do not use more than one of 'speed', 'qskip', and 'rskip'.\n", "\n", "Trimming/Filtering/Masking parameters:\n", "Note - if ktrim, kmask, and ksplit are unset, the default behavior is kfilter.\n", "All kmer processing modes are mutually exclusive.\n", "Reads only get sent to 'outm' purely based on kmer matches in kfilter mode.\n", "\n", "ktrim=f Trim reads to remove bases matching reference kmers.\n", " Values: \n", " f (don't trim), \n", " r (trim to the right), \n", " l (trim to the left)\n", "kmask= Replace bases matching ref kmers with another symbol.\n", " Allows any non-whitespace character, and processes short\n", " kmers on both ends if mink is set. 'kmask=lc' will\n", " convert masked bases to lowercase.\n", "maskfullycovered=f (mfc) Only mask bases that are fully covered by kmers.\n", "ksplit=f For single-ended reads only. Reads will be split into\n", " pairs around the kmer. If the kmer is at the end of the\n", " read, it will be trimmed instead. Singletons will go to\n", " out, and pairs will go to outm. Do not use ksplit with\n", " other operations such as quality-trimming or filtering.\n", "mink=0 Look for shorter kmers at read tips down to this length, \n", " when k-trimming or masking. 0 means disabled. Enabling\n", " this will disable maskmiddle.\n", "qtrim=f Trim read ends to remove bases with quality below trimq.\n", " Performed AFTER looking for kmers. Values: \n", " rl (trim both ends), \n", " f (neither end), \n", " r (right end only), \n", " l (left end only),\n", " w (sliding window).\n", "trimq=6 Regions with average quality BELOW this will be trimmed,\n", " if qtrim is set to something other than f. Can be a \n", " floating-point number like 7.3.\n", "trimclip=f Trim soft-clipped bases from sam files.\n", "minlength=10 (ml) Reads shorter than this after trimming will be \n", " discarded. Pairs will be discarded if both are shorter.\n", "mlf=0 (minlengthfraction) Reads shorter than this fraction of \n", " original length after trimming will be discarded.\n", "maxlength= Reads longer than this after trimming will be discarded.\n", "minavgquality=0 (maq) Reads with average quality (after trimming) below \n", " this will be discarded.\n", "maqb=0 If positive, calculate maq from this many initial bases.\n", "minbasequality=0 (mbq) Reads with any base below this quality (after \n", " trimming) will be discarded.\n", "maxns=-1 If non-negative, reads with more Ns than this \n", " (after trimming) will be discarded.\n", "mcb=0 (minconsecutivebases) Discard reads without at least \n", " this many consecutive called bases.\n", "ottm=f (outputtrimmedtomatch) Output reads trimmed to shorter \n", " than minlength to outm rather than discarding.\n", "tp=0 (trimpad) Trim this much extra around matching kmers.\n", "tbo=f (trimbyoverlap) Trim adapters based on where paired \n", " reads overlap.\n", "strictoverlap=t Adjust sensitivity for trimbyoverlap mode.\n", "minoverlap=14 Require this many bases of overlap for detection.\n", "mininsert=40 Require insert size of at least this for overlap.\n", " Should be reduced to 16 for small RNA sequencing.\n", "tpe=f (trimpairsevenly) When kmer right-trimming, trim both \n", " reads to the minimum length of either.\n", "forcetrimleft=0 (ftl) If positive, trim bases to the left of this position\n", " (exclusive, 0-based).\n", "forcetrimright=0 (ftr) If positive, trim bases to the right of this position\n", " (exclusive, 0-based).\n", "forcetrimright2=0 (ftr2) If positive, trim this many bases on the right end.\n", "forcetrimmod=0 (ftm) If positive, right-trim length to be equal to zero,\n", " modulo this number.\n", "restrictleft=0 If positive, only look for kmer matches in the \n", " leftmost X bases.\n", "restrictright=0 If positive, only look for kmer matches in the \n", " rightmost X bases.\n", "mingc=0 Discard reads with GC content below this.\n", "maxgc=1 Discard reads with GC content above this.\n", "gcpairs=t Use average GC of paired reads.\n", " Also affects gchist.\n", "tossjunk=f Discard reads with invalid characters as bases.\n", "swift=f Trim Swift sequences: Trailing C/T/N R1, leading G/A/N R2.\n", "\n", "Header-parsing parameters - these require Illumina headers:\n", "chastityfilter=f (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.\n", "barcodefilter=f Remove reads with unexpected barcodes if barcodes is set,\n", " or barcodes containing 'N' otherwise. A barcode must be\n", " the last part of the read header. Values:\n", " t: Remove reads with bad barcodes.\n", " f: Ignore barcodes.\n", " crash: Crash upon encountering bad barcodes.\n", "barcodes= Comma-delimited list of barcodes or files of barcodes.\n", "xmin=-1 If positive, discard reads with a lesser X coordinate.\n", "ymin=-1 If positive, discard reads with a lesser Y coordinate.\n", "xmax=-1 If positive, discard reads with a greater X coordinate.\n", "ymax=-1 If positive, discard reads with a greater Y coordinate.\n", "\n", "Polymer trimming:\n", "trimpolya=0 If greater than 0, trim poly-A or poly-T tails of\n", " at least this length on either end of reads.\n", "trimpolygleft=0 If greater than 0, trim poly-G prefixes of at least this\n", " length on the left end of reads. Does not trim poly-C.\n", "trimpolygright=0 If greater than 0, trim poly-G tails of at least this \n", " length on the right end of reads. Does not trim poly-C.\n", "trimpolyg=0 This sets both left and right at once.\n", "filterpolyg=0 If greater than 0, remove reads with a poly-G prefix of\n", " at least this length (on the left).\n", "Note: there are also equivalent poly-C flags.\n", "\n", "Polymer tracking:\n", "pratio=base,base 'pratio=G,C' will print the ratio of G to C polymers.\n", "plen=20 Length of homopolymers to count.\n", "\n", "Entropy/Complexity parameters:\n", "entropy=-1 Set between 0 and 1 to filter reads with entropy below\n", " that value. Higher is more stringent.\n", "entropywindow=50 Calculate entropy using a sliding window of this length.\n", "entropyk=5 Calculate entropy using kmers of this length.\n", "minbasefrequency=0 Discard reads with a minimum base frequency below this.\n", "entropytrim=f Values:\n", " f: (false) Do not entropy-trim.\n", " r: (right) Trim low entropy on the right end only.\n", " l: (left) Trim low entropy on the left end only.\n", " rl: (both) Trim low entropy on both ends.\n", "entropymask=f Values:\n", " f: (filter) Discard low-entropy sequences.\n", " t: (true) Mask low-entropy parts of sequences with N.\n", " lc: Change low-entropy parts of sequences to lowercase.\n", "entropymark=f Mark each base with its entropy value. This is on a scale\n", " of 0-41 and is reported as quality scores, so the output\n", " should be fastq or fasta+qual.\n", "NOTE: If set, entropytrim overrides entropymask.\n", "\n", "Cardinality estimation:\n", "cardinality=f (loglog) Count unique kmers using the LogLog algorithm.\n", "cardinalityout=f (loglogout) Count unique kmers in output reads.\n", "loglogk=31 Use this kmer length for counting.\n", "loglogbuckets=2048 Use this many buckets for counting.\n", "khist= Kmer frequency histogram; plots number of kmers versus\n", " kmer depth. This is approximate.\n", "khistout= Kmer frequency histogram for output reads.\n", "\n", "Java Parameters:\n", "\n", "-Xmx This will set Java's memory usage, overriding autodetection.\n", " -Xmx20g will \n", " specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. \n", " The max is typically 85% of physical memory.\n", "-eoom This flag will cause the process to exit if an \n", " out-of-memory exception occurs. Requires Java 8u92+.\n", "-da Disable assertions.\n", "\n", "Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.\n", "\n" ] } ], "source": [ "%%bash\n", "bbduk.sh -h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the tool was written by Brian Bushnell and there are a lot of options you can specify with this tool. It can be pretty overwhelming to read and understand each of the parameter available for you to use. So I am giving you an example command with parameters I use for routing processing of raw sequence files.\n", "\n", "```bash\n", "bbduk.sh ktrim=r ordered minlen=50 mink=11 tbo rcomp=f k=21 ow=t ftm=5 zl=4 \\\n", " qtrim=rl trimq=20 \\\n", " in1=SRR12584454_1.fastq \\\n", " in2=SRR12584454_2.fastq \\\n", " ref=adapters.fa \\\n", " out1=SRR12584454_1.trimmed.fastq \\\n", " out2=SRR12584454_2.trimmed.fastq\n", "```\n", "\n", "Here you will notice a few things that are important in this example. First, I am telling `bbduk` to only keep sequences longer than 50 bases and quality higher than 20. And I am specifying the adapter file in `ref=adapters.fa` flag. `in1` and `in2` are where you specify input fastq files and `out1` and `out2` are where you specify output files, which should be named differently so that the program doesn't overwrite the original files. I will show what it looks like when you type this command." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jsaw/miniconda3/opt/bbmap-38.86-0//calcmem.sh: line 75: [: -v: unary operator expected\n", "Max memory cannot be determined. Attempting to use 1400 MB.\n", "If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, \n", "or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.\n", "java -ea -Xmx1400m -Xms1400m -cp /Users/jsaw/miniconda3/opt/bbmap-38.86-0/current/ jgi.BBDuk ktrim=r ordered minlen=50 mink=11 tbo rcomp=f k=21 ow=t ftm=5 zl=4 qtrim=rl trimq=20 in1=SRR12584454_1.fastq in2=SRR12584454_2.fastq ref=adapters.fa out1=SRR12584454_1.trimmed.fastq out2=SRR12584454_2.trimmed.fastq\n", "Executing jgi.BBDuk [ktrim=r, ordered, minlen=50, mink=11, tbo, rcomp=f, k=21, ow=t, ftm=5, zl=4, qtrim=rl, trimq=20, in1=SRR12584454_1.fastq, in2=SRR12584454_2.fastq, ref=adapters.fa, out1=SRR12584454_1.trimmed.fastq, out2=SRR12584454_2.trimmed.fastq]\n", "Version 38.86\n", "\n", "Set ORDERED to true\n", "maskMiddle was disabled because useShortKmers=true\n", "0.038 seconds.\n", "Initial:\n", "Memory: max=1468m, total=1468m, free=1439m, used=29m\n", "\n", "Added 3225 kmers; time: \t0.038 seconds.\n", "Memory: max=1468m, total=1468m, free=1433m, used=35m\n", "\n", "Input is being processed as paired\n", "Started output streams:\t0.034 seconds.\n", "Processing time: \t\t6.863 seconds.\n", "\n", "Input: \t207684 reads \t\t62512884 bases.\n", "QTrimmed: \t155680 reads (74.96%) \t14257391 bases (22.81%)\n", "FTrimmed: \t207684 reads (100.00%) \t207684 bases (0.33%)\n", "KTrimmed: \t128 reads (0.06%) \t25083 bases (0.04%)\n", "Trimmed by overlap: \t167802 reads (80.80%) \t1345557 bases (2.15%)\n", "Total Removed: \t12800 reads (6.16%) \t15835715 bases (25.33%)\n", "Result: \t194884 reads (93.84%) \t46677169 bases (74.67%)\n", "\n", "Time: \t6.937 seconds.\n", "Reads Processed: 207k \t29.94k reads/sec\n", "Bases Processed: 62512k \t9.01m bases/sec\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "bbduk.sh ktrim=r ordered minlen=50 mink=11 tbo rcomp=f k=21 ow=t ftm=5 zl=4 \\\n", " qtrim=rl trimq=20 \\\n", " in1=SRR12584454_1.fastq \\\n", " in2=SRR12584454_2.fastq \\\n", " ref=adapters.fa \\\n", " out1=SRR12584454_1.trimmed.fastq \\\n", " out2=SRR12584454_2.trimmed.fastq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the tool finishes this task (which is pretty fast), you will see that it found 207,684 reads (if you combine the 2 fastq files) and removed 12,800 reads that do not meet the criteria I have set. And it retained 93.84% of the reads. The whole process only took less than 7 seconds. Now, if you look into the folder, you will notice 2 new files being produced that end with \"`.trimmed.fastq`\"." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 238M\n", "drwxr-xr-x 11 jsaw 982768932 352 Sep 9 10:29 .\n", "drwxr-xr-x 6 jsaw 982768932 192 Sep 9 08:58 ..\n", "-rw-r--r-- 1 jsaw 982768932 67M Sep 9 09:05 SRR12584454_1.fastq\n", "-rw-r--r-- 1 jsaw 982768932 53M Sep 9 10:29 SRR12584454_1.trimmed.fastq\n", "-rw-r--r-- 1 jsaw 982768932 952K Sep 9 09:41 SRR12584454_1_fastqc.html\n", "-rw-r--r-- 1 jsaw 982768932 861K Sep 9 09:41 SRR12584454_1_fastqc.zip\n", "-rw-r--r-- 1 jsaw 982768932 67M Sep 9 09:05 SRR12584454_2.fastq\n", "-rw-r--r-- 1 jsaw 982768932 44M Sep 9 10:29 SRR12584454_2.trimmed.fastq\n", "-rw-r--r-- 1 jsaw 982768932 951K Sep 9 09:41 SRR12584454_2_fastqc.html\n", "-rw-r--r-- 1 jsaw 982768932 848K Sep 9 09:41 SRR12584454_2_fastqc.zip\n", "-rwxr-x--- 1 jsaw 982768932 14K Sep 9 10:28 adapters.fa\n" ] } ], "source": [ "%%bash\n", "cd ~/workspace/MicrobialGenomicsLab/exercises/\n", "ls -lah" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step after this is to check if sequence qualities have improved after you have run `bbduk` tool. To do that, you run `fastqc` tool again but on the trimmed fastq files. Type:\n", "\n", "```bash\n", "fastqc *.trimmed.fastq\n", "```\n", "\n", "And inspect the sequence qualities by opening the html files produced by `fastqc`. Noticed any differences? You can open the `fastqc` report of the original file in another tab to see how different things are. You should see something like this:\n", "\n", "![trimmed](images/fqct.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you might notice, the histogram denoting sequence quality have drastically improved (no more columns dipping into the red zone). Now, the sequences are good to be used in downstream processes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Preparations for next week's lab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be doing genome and metagenome assemblies next week. For that, you will download a few publicly available datasets and also familiarize yourself with using `ssh` and `rsync` tools to remotely connect to and transfer files to other computers. You can look at this page for some examples of how to use `ssh`.\n", "\n", "https://phoenixnap.com/kb/linux-ssh-commands\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another tool that will be crucial for you to use next week will be this tool known as `rsync`. It is a command line tool that can be used to transfer files between two remote computers or even for syncing directories inside the same computer. You can see some examples of how to use `rsync` here:\n", "\n", "https://www.tecmint.com/rsync-local-remote-file-synchronization-commands/\n", "\n", "and here:\n", "\n", "https://phoenixnap.com/kb/rsync-command-linux-examples\n", "\n", "Try to play around with a few examples shown in these pages to become familar with how `rsync` works." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download data for next week" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using publicly available sequence data from NCBI SRA and see if you can replicate their results. The first one to download is here:\n", "\n", "https://www.ncbi.nlm.nih.gov/sra/SRX9094324\n", "\n", "These are Illumina MiSeq sequences they have used to reconstruct the genome of a *Salmonella enteria* strain. This is a cultured isolate and represents a single organism. Use `fasterq-dump` tool to download this data. I recommend you put it in `data` folder on your computer. These are fairly large files (each about 450 MB). In order to save space, you can compress them after they are downloaded. You can type like this:\n", "\n", "```bash\n", "gzip *.fastq\n", "```\n", "\n", "And this will compress your text files into much smaller zipped files with extension \"*.gz\". The file sizes are much smaller after being compressed (about 110-130 MB).\n", "\n", "For metagenome assemblies, we will be downloading this file:\n", "\n", "https://www.ncbi.nlm.nih.gov/sra/SRX4741377\n", "\n", "Make sure you use `fasterq-dump` to download it. Again, I recommend you put it in `data` folder on your computer and also use `gzip` to compress the fastq files as these are very large files (each about 4.5 GB, a total of 9 GB). These compressed files can be used directly by both `fastqc` and `bbduk`.\n", "\n", "Now you are ready for next week's lab." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }