Computational lab exercises

Install R and RStudio

You should install both R and RStudio on your laptop.

Go to R website here: https://www.r-project.org/ Download and install the latest version of R, which is 4.1.3 (2022-03-10). They have versions for Mac, Linux, and Windows operating systems. Those of you using Windows OS but has Ubuntu should install the Windows version.

Download and install RStudio here after R was installed first: https://rstudio.com/

Choose the “RStudio Desktop” Free version.

Start RStudio and install required packages

After starting RStudio, you will see three panels:

the left panel is the R console

the top right panel is the environment, where you will be able to see the data that you import and all the R objects that will be generated while you run your script

the bottom right panel where you have the help and where you can see plots that are generated when you run your script

Start by clicking

File –> New File –> R script (instead, in Windows, you can use the shortcut Ctrl+Shift+N)

Now you will see a new panel coming up on the top left.

This is very handy because here you can type your commands, you can correct them and change them as many times as you want before you actually run them.

Type the following commands in your new R script, select all of them and click the Run button.

Alternatively, you can place your cursor at the beginning or at the end of each line and click run. This way, you will see the command running in the R console, the cursor will move to the next line on the R script, you will click again run etc etc. Running commands line by line makes it easier to identify and correct errors as you go.

#install the dada2 package
install.packages("devtools")
library("devtools")
devtools::install_github("benjjneb/dada2")
#the latest version is 1.22.0
#you can run
#devtools::install_github("benjjneb/dada2", ref="v1.22")
#if you want to specify this (or another) particular version

#install the phyloseq package
source("http://bioconductor.org/biocLite.R")
biocLite("phyloseq")

Now, DADA2 and phyloseq should be installed on your R/RStudio environment.

If you encounter errors, you can read into more detail the installation instructions

for DADA2: https://benjjneb.github.io/dada2/dada-installation.html

for phyloseq: https://joey711.github.io/phyloseq/install.html

You can save your R scipt, so that you can use it again in the future, or go back to it when you want. You can also check the section Documenting your workflow in R below.

Some resources for learning R

Now, the DADA2 is an R package meant for people who have some basic understanding of the R programming language. It has pretty steep learning curve and I won’t be teaching you how to use R in this course in detail. I will leave with you some resources here that you might want to look up outside of classroom hours to learn more about R.

The first set of resources that you might find useful are so-called “cheat sheets”. People have designed and created these “cheat sheets” that are easy to look up some commonly used R commands and packages to analyze data and to plot graphs. Here are a few:

https://rstudio.com/resources/cheatsheets/

As you can see, the list of cheat sheets can be quite overwhelming and they can be for beginners or to advanced users who are developing packages. Some of the important ones you should at least take a look at are listed below:

Install reference data files needed to run DADA2

You will need to download some reference data files before you can start running DADA2 on your laptop. These are the files you need to download and save somewhere on your laptop computer. These are files you will be using when you will need to assign taxonomy to your sequences. More information can be found here: https://benjjneb.github.io/dada2/training.html

We will be trying to analyze our data with both SILVA (version 138.1) and GTBD (Version 86), so the files that you need to download are:

- silva_nr99_v138.1_train_set.fa.gz

- silva_species_assignment_v138.1.fa.gz

- GTDB_bac-arc_ssu_r86.fa.gz

- GTDB_dada2_assignment_species.fa.gz

I suggest you download them and move them in a ~/data/taxonomy folder.

Run an example dataset using DADA2 tutorial

DADA2 tutorial is found here: https://benjjneb.github.io/dada2/tutorial.html

You can go through the tutorial as described on the web page except the part on using “DECIPHER”. Skip that part. You should be able to perform the tutorial from start to finish without any problems if you follow the instructions carefully. You should download this example dataset first, as instructed in the tutorial: https://mothur.s3.us-east-2.amazonaws.com/wiki/miseqsopdata.zip

And unzip the file using either unzip miseqsopdata.zip command or by any unzip utilities that might already be installed on your computer. For my case, I downloaded it to data folder and after unzipping, there will be a new folder named ~/data/MiSeq_SOP so I would set my path in the tutorial as path <- "~/data/MiSeq_SOP" when I follow the DADA2 tutorials.

Follow the instructions on the DADA2 tutorial page from start to finish but please make sure you substitute a few file names in some parts. The first one is under the “Assign Taxonomy” part. First thing to do when you use R is to make sure that you are at the right working directory. To do that, you type like this in R.

#First check in which working directory you are now (this command its like pwd in UNIX)
getwd()

##Change the working directory (like cd in UNIX)
setwd("~/data/MiSeq_SOP/")

#Check again if you have changed the working directory successfully
getwd()

Pay attention in the assign taxonomy step, you will need to specify the correct path and the correct database, i.e. the one you have downloaded. It should be, e.g.,

taxa <- assignTaxonomy(seqtab.nochim, "~/data/taxonomy/silva_nr99_v138.1_train_set.fa.gz", multithread=FALSE)

In addition, you should be careful at the next step, at the add species function. Again, it should be, e.g.

taxa <- addSpecies(taxa, "~/data/taxonomy/silva_species_assignment_v138.1.fa.gz")

Make sure to read Considerations for your own data parts carefully. This is where you have to pay attention when you are trying to adapt the DADA2 pipeline to your own data, which will have different file names or sequences to begin with. Read each part carefully and try to make sense of what they are trying to do. It might be a bit difficult to make sense of in the beginning but we can go over the parts together carefully to see what’s going on.

For your assistance,

you can download my version of the DADA2 tutorial from here:

https://github.com/cpavloud/16S_analysis/blob/main/DADA2_16S_forSpring2022.R

and of the phyloseq tutorial from here:

https://github.com/cpavloud/16S_analysis/blob/main/phyloseq_16S_forSpring2022.R

You can copy the scripts and paste them in a new R script in Rstudio, run them or modify them as you want, provided that your modifications makes sense.

Run DADA2 on 16S amplicon data from publicly available datasets

Now you will run DADA2 on publicly available data. Use fasterq-dump tool to download these from NCBI SRA database. I suggest you download them to a subfolder under your data folder. I would suggest naming it sra so it would be something like ~/data/sra/.

- SRR11477574
- SRR11477573
- SRR11477577
- SRR11477575

These are relatively small files (compared to metagenomes) and should not take too long to download them all.

You will notice that the files end with .fastq format (there should be 8 files).

After all the files have been downloaded, you should first zip the files by typing: gzip *.fastq

Then try to run DADA2 pipeline using these files but make sure to adapt the original instructions in the page to fit the file names you have from these downloads.

You need to make a few modifications to make sure that the names are matching. First, set your path to correct folder. The few important adaptations we will have to do are making sure the name patterns in your input files are matching. Here are the fist few things I have to make changes, for example:

path <- "~/workspace/data/sra"
fnFs <- sort(list.files(path, pattern="_1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2.fastq.gz", full.names = TRUE))

After these are set, then you should be able to proceed with the same set of commands mostly.

Try to run from start to finish but skip these two parts: - on using DECIPHER - on “Evaluate accuracy”

When you reach the part on phyloseq, then you will need to adapt a few things. The example shown were for a study on human microbiomes in different people and how the microbial communities shift or change under certain conditions.

Here is their original study: https://aem.asm.org/content/79/17/5112

You can see more detail about how this study is being used in benchmarking tools here: https://mothur.org/wiki/miseq_sop/

This study has been used as an example dataset for people to test the performance of their tools or to teach students how to use 16S amplicon data to survey and compare microbial communities. However, the data you just downloaded and tried to run DADA2 on is from hot springs. They came from two different hot springs (2 samples each). So here, you will try to calculate Alpha Diversity and microbial community composition in these springs. Their original sample names (along with their SRA accession numbers are:

  • SRR11477573 (RC3)

  • SRR11477574 (RC1)

  • SRR11477575 (BP3)

  • SRR11477577 (BP1)

So for the part on phyloseq, you will need to make a few changes before proceeding to adapt your workflow to match the study names and conditions. Because we do not have similar conditions as the example study used in the tutorial, we need to change a few things here. I have modified the workflow and shown you below what you need to do for your downloaded samples. Go through them and see if they work for you.

samples.out <- rownames(seqtab.nochim)

subject <- sapply(strsplit(samples.out, "D"), `[`, 1)

pool <- c("RC3", "RC1", "BP3", "BP1")

samdf <- data.frame(Subject=subject, Pool=pool)

rownames(samdf) <- samples.out

ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
               sample_data(samdf),
               tax_table(taxa))

dna <- Biostrings::DNAStringSet(taxa_names(ps))

names(dna) <- taxa_names(ps)

ps <- merge_phyloseq(ps, dna)

taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))

ps

plot_richness(ps, x="Pool", measures=c("Shannon", "Simpson"))

ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu))

ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")

plot_ordination(ps.prop, ord.nmds.bray, color="Pool", title="Bray NMDS")

top20 <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]

ps.top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))

ps.top20 <- prune_taxa(top20, ps.top20)

plot_bar(ps.top20, x="Pool", fill="Family")

## Note that this only plots the top 20 most abundant taxa at the Family level.
## If you want to plot all the taxa identified at the phylum level, try this below:

plot_bar(ps, x="Pool", fill="Phylum")

## What happens now?

We will go over the results in a little bit after everyone is done with the workflow. At the end of the plots generated at the end, what can you tell from these 4 samples? Which sample appears to be the most diverse?

One thing you will note is that after the workflow you went through, no files (except the filtered fastq files) were being produced when you inspect the work folders using your terminal. Everything is contained within the R environment. If you want to save the plots being produced, then you need to click on “Export” option on the plotting panel and save it as PDF or image file.

Documenting your workflow in R

One thing that is common for people to do in R is to save the commands you typed from start to finish in a script file that ends with an extension .R. This way, if you ever needed to go back and check what you did for a study or to produce a figure, you can do that. And provided you kept all the files in the right place, you can re-run your analysis from start to finish (even if you have to modify things slightly).

In RStudio, you can do that by clicking on the “+” sign near top left of the menu bar, click on “R Script”, then copy and paste all the commands you have just typed (starting with library()).

Create a new R script and paste all your commands typed in there (for both the example dataset and the public data you downloaded from SRA). Save the script as DADA2_analysis.R in your exercises folder.

Some important terminology for you to learn and remember

Microbial community analysis draws some concepts, practices, and terminology from the general field of Ecology. A lot of the terms used in microbial community analyses overlaps with those used in ecology of macro-organisms. Some of them are listed and explained below:

  • OTU (Operational Taxonomic Unit, group of related individuals); see here:

https://en.wikipedia.org/wiki/Operational_taxonomic_unit

  • Alpha Diversity (A measure of species diversity in a given habitat); see here:

https://en.wikipedia.org/wiki/Alpha_diversity

  • Shannon Index (A Diversity index); see here:

https://en.wikipedia.org/wiki/Diversity_index

  • ASV (Amplicon Sequence Variant); see here:

https://en.wikipedia.org/wiki/Amplicon_sequence_variant

  • NMDS (Non-metric Multi-Dimensional Scaling); see here:

https://en.wikipedia.org/wiki/Multidimensional_scaling

  • Bray-Curtis dissimilarity; see here:

https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity