Week 6: Classifying taxonomy of short reads with mothur


Today we’re going to learn how to use sequence data to assess diversity across samples by focusing on just the 16S rRNA gene. You’ll recall that 16S rRNA is like a “barcode” gene that we can use to compare and classify who is there. It’s often used in microbiome studies. Today we’ll learn how to:

  • classify the distribution of different types (taxa) of microbes in different datasets (i.e. what % are Procholorococcus, what % are SAR11, what % are Thaumarchaeota, and so on)
  • calculate diversity metrics to determine which sample sites had higher richness or evenness.

Before you start, think about which datasets you’d like to compare to your own. Perhaps you want to compare a bunch of surface samples from different regions, or perhaps you want to compare the three depths from your region. This could also potentially feed right into your final project. Choose 3-5 samples to compare (including yours).

Using mothur to profile the taxonomy of your dataset

1. Log in

Once you have decided on what samples to compare, log on to baross using ssh on your Terminal. Then make a new directory and change directory into it.

mkdir ~/project_directory/taxonomy
cd project_directory/taxonomy

2. Copy over data

The Tara Ocean people have already identified all of the reads that matched 16S ribosomal RNA from their metagenomes, and they made separate FASTA files with just those reads. (Thanks, French scientists!) I copied those FASTA files into the directories listed in /usr/local/data/Tara_datasets/. Copy over all of the relevant 16S rRNA files that you will need into your new taxonomy folder.

cp /usr/local/data/Tara_datasets/[project sample site of interest]/[16S rRNA data file of interest] .

For example, you might do something like this:

cp /usr/local/data/Tara_datasets/Arabian_Sea/ERR598966_MERGED_FASTQ_16SrRNA_10000.fasta .
cp /usr/local/data/Tara_datasets/coastal_South_Africa/ERR598972_MERGED_FASTQ_16SrRNA.fasta .
cp /usr/local/data/Tara_datasets/North_Pacific/ERR598995_MERGED_FASTQ_16SrRNA_10000.fasta .

Hot tip! If you want to copy over all of the 16S rRNA datasets from one folder, you could use the asterisk (wildcard). For example: cp /usr/local/data/Tara_datasets/North_Pacific/*16S* . That would copy over all of the files containing 16S in their title in the North Pacific folder.

3. Open mothur

Now, we’re going to use a program called mothur to analyze these sequences. mothur is a bit different from other programs that we’ve used in that we can enter a mothur interface and type commands that are specific to mothur. To enter the mothur program, simply type this:


4. Create groups file

Right now, the 16S rRNA-matching reads from each sample site are in separate files. Pretty soon we are going to merge all of your FASTA files together in order to compare them. Before we do that, we need to tell mothur how to tell them apart once they are merged. We do that with the make.group command. Replace the file names below with your 16S rRNA fasta file names. Substitute the group names with a suitable name for your sample so that you can recognize it, like NPacific_DCM or Arabian_surface.

Note: any time you see something in [brackets] in a command, it means you have to substitute that with your own data. Don’t include the brackets in your command.

make.group(fasta=[file1.fasta]-[file2.fasta]-[file3.fasta], groups=[group1]-[group2]-[group3])

For example:

make.group(fasta=ERR598944_MERGED_FASTQ_16SrRNA_10000.fasta-ERR599001_MERGED_FASTQ_16SrRNA_10000.fasta-ERR599078_MERGED_FASTQ_16SrRNA_10000.fasta, groups=meso-transect-surface)

5. Look at the groups file

This command should generate a file that is called either groups or merge.groups. Take a look at it with less.

Hot tip #2! You can use the command system() if you want to use Unix commands while you are using mothur.

You will see that each sequence name is linked up with the group name that you provided. That way mothur can combine all of the sequences together into one file, but you can still keep track of which one belongs to which sample. This file will be essential for allowing mothur to compare your samples later on.

system(less groups)

6. Merge FASTA files together

Now you can merge all of your FASTA files together, and the .groups file will record which sequences came from which file. The output here will be a file called merged.fa. Again, substitute “file-1.fa” and so on with the names of your 16S rRNA fasta files.

Hot tip #3! use the up arrow on your keyboard to call up the last command you typed, and then edit that command instead of retyping all of the filenames again.

merge.files(input=[file1.fa]-[file2.fa]-[file3.fa], output=merged.fa)

For example:

merge.files(input=ERR598944_MERGED_FASTQ_16SrRNA_10000.fasta-ERR599001_MERGED_FASTQ_16SrRNA_10000.fasta-ERR599078_MERGED_FASTQ_16SrRNA_10000.fasta, output=merged.fa)

7. Classify your sequences

Everything we’ve done so far is basically record-keeping. Now it’s time to do SCIENCE!

We will classify our sequences by comparing them to a reference database. We will the SILVA database to compare these sequences. (It’s a very good, well-curated 16S rRNA database.)

Note! You will see lots of warnings along the lines of: “[WARNING]: xxx could not be classified.” We are going to have to leave these sequences out of the analysis! This means all of the unknowns will be grouped together even though they most likely represent many different species, so they will be missing from our diversity analyses later on. (In class, we’ll talk about why we would ideally use something like operational taxonomic units, or OTUs, to do this analysis. Unfortunately, our metagenomic data is too messy to be able to make nice OTUs.)

classify.seqs(fasta=merged.fa, group=groups, reference=/usr/local/data/silva_databases/silva.seed_v119.align, taxonomy=/usr/local/data/silva_databases/silva.seed_v119.tax)

8. Open classified sequences

Use scp to transfer your files over to your local desktop. Open the file that is called merged.seed_v119.wang.tax.summary in Excel. (You may have to change the name so the file ends in ‘.txt’ or Excel won’t recognize it as a valid file to open.) Here is the definition of the columns, from left to right:

  • Taxonomic level is in the farthest left-hand column. The lower the number, the larger the phylogenetic classification, starting with domain, then phylum, class, order, family, genus, species. For example, Archaea, Bacteria, Eukarya, and ‘unknown’ are taxonomic level 1. Taxonomic level 2 classifies different phyla of Archaea, Bacteria, and Eukaryotes. Taxonomic level 3 classifies different classes of those phyla, and so on.
  • The rankID provides a means of keeping track of where that particular organism falls. For example, the SAR_11 clade is rankID, which means it is a clade within the Alphaproteobacteria (, which are a clade within the Proteobacteria (rankID 0.2.17), which is a clade within the Bacteria (rankID 0.2).
  • The taxon column tells you the name of the taxon.
  • The daughter level tells you how may levels down you are in the phylogeny.
  • The ‘total’ tells you how many total sequences are within that taxonomic category.
  • Each of the following columns gives you the taxonomic breakdown for that sample.

I recommend sorting your Excel spreadsheet by taxlevel. Make either a pie chart or stacked bar chart for each sample depth based on the taxonomic distribution at taxonomic level 2 to compare the taxonomic breakdown between each of your samples. Save these charts as Figure 1 and provide a figure caption for this figure.

9. Share your data

Your classmates may wish to use your taxonomy data for their project datasets. Please rename your taxonomy files and share them on the class_shared directory.

mv merged.seed_v119.wang.tax.summary [newname]
cp [newname] /Accounts/Genomics_Bioinformatics_shared/taxonomy

For example:

mv merged.seed_v119.wang.tax.summary rikas_taxonomy.summary
cp rikas_taxonomy.summary /Accounts/Genomics_Bioinformatics_shared/taxonomy

10. Calculate the diversity of your sample

Now we’re going to calculate the diversity at each of your sample sites. These will be pure calculations in Excel rather than on the command-line. mothur can do this using OTUs, but our data is too messy to create nice OTUs!

We’ll calculate diversity using two measures: species richness (r) and the Shannon-Weiner Index (H’). The Shannon-Weiner Index (H’) is meant to take into account both the taxon richness (i.e. how many different taxa there are) and evenness (i.e. does one taxon dominate, or are they evenly distributed?)

We are going to calculate the diversity of each of your sample sites at taxonomic level 2. This means that each entry at level 2 (i.e. ‘Euryarchaeota,’ ‘Thaumarchaeota,’ ‘Proteobacteria’) will count as one taxon. Calculate the richness and the Shannon-Weiner index for each of your samples.

Richness = r = number of taxa in your sample

The equation for the Shannon-Weiner index is:

H’ = -Σ (Pi ln(Pi))

H’ = index of taxonomic diversity, the Shannon-Weiner Index

Pi = proportion (percent) of total sample belonging to the ith taxon

ln = natural log (log base e = not the same as log!)

This index takes into account not just the number of taxa (richness) in a sample, but also how evenly distributed the taxa are (evenness) within the sample.  The index increases by having more richness and/or by having greater evenness.

Hint: -Σ (Pi ln(Pi)) = -((Ptaxon1*ln(Ptaxon1)) + (Ptaxon2*ln(Ptaxon2)) + (Ptaxon3*ln(Ptaxon3)) + …)

I suggest that you start by calculating the total number of sequences for each sample site for all of taxonomic level 2. Then, for each taxon within taxlevel2, calculate the total proportion of sequences belonging to that taxon (Pi). Then calculate H’.

Excel command for natural log: LN() Note that in Excel, LN(0) = error, so skip the cells with a value of 0. Pay attention to your use of parentheses!

Please do these calculations in a way that is clear so that I can track your calculations. You will be submitting these Excel spreadsheets as part of your lab assignment for this week. Please compile the total number of sequences, the species richness, and the Shannon-Weiner index for each of your sample depths in a table. Save this as Table 1 and provide a caption.

11. Compare diversity with metadata

In your Excel spreadsheet, make 5 columns containing metadata for each of your samples (1 column each for temperature, chlorophyll, nitrate, oxygen, and salinity). Make a scatterplot for each of the metadata versus your Shannon-Weiner index (H’) for each of the samples. For example, one plot will have temperature on the x axis, and H’ on the y axis, one will have chlorophyll on the x axis, and H’ on the y axis, and so on. For each plot, plot a trendline through the data and include the equation and the R-squared value. (Note that since in most cases you won’t have more than, say, 3 samples, we’re not going to do a formal statistical test on these datasets. Therefore, based on what you’ve done here, we can’t yet conclude whether the relationships are significant.) Save these 5 plots as Figure 2(abcde).

If you’re working synchronously and in gather.town, please discuss the Google Doc questions with other folks at your table to make sure you’re understanding the basic concepts.

Postlab assignment question Look at your results. What kinds of trends do you see? What were you expecting to see, and do your results support your initial expectations? Write 1-2 paragraphs explaining your results in light of what we have discussed in class (and perhaps based on what you have seen in your previous postlabs) about diversity in various ocean basins and at different depths.

Summary of what to turn in next week: For this week’s post-lab assignment, please submit the following (should all be in the same document):

  • Figure 1
  • Figure 2(abcde)
  • Table 1
  • The response to the Postlab question
  • Your Excel spreadsheet so that I can check your calculations (submitted separately)