Week 5: Binning genomes with anvi’o¶
This week in lab we’ll learn how disentangle individual microbial genomes from your mess of metagenomic contigs. We aren’t going to use a toy dataset this week—-we’re going straight into analysis with your metagenome datasets for your projects.
Genomes are disentangled from metagenomes by clustering contigs according to two properties: coverage and tetranucleotide frequency. Basically, if contigs have similar coverage patterns between datasets, they are clustered together; and if contigs have similar kmers, they will cluster together. When we cluster contigs together like this, we get a collection of contigs that are thought to represent a reconstruction of a genome from your metagenomic sample. We call these ‘genome bins,’ or ‘metagenome-asembled genomes’ (MAGs). We will talk more about this in class.
There is a lot of discussion in the field about which software packages are the best for making these genome bins. And of course, the one you choose will depend a lot on your dataset, what you’re trying to accomplish, and personal preference. I chose anvi’o because it is a nice visualization tool that builds in many handy features.
I am drawing a lot of information for this tutorial from the anvi’o website. If you’d like to learn more, see this link.
Preparing your contigs database for anvi’o¶
1. ssh tunnel¶
As always, please log into the Carleton VPN, log into gather.town (if working synchronously), and open up your Terminal application. But don’t ssh the normal way! Anvio requires visualization through the server, so this week we have to create what is called an “ssh tunnel” to log into the server in a specific way. Substitute “username” below with your own Carleton login name.
NOTE: Each of you will be assigned a different port number (i.e. 8080, 8081, 8082, etc.). I’ll put that in the Slack channel for this week. Substitute your assigned port number for 8080 shown below.
ssh -L 8080:localhost:8080 email@example.com
2. Make new folder¶
Make a new directory called “anvio” inside your project folder, then change into that directory.
cd project_directory mkdir anvio cd anvio
3. Copying data¶
Last week, we learned how to make BAM files. This week, we’ll need them to make our bins, because they give us coverage information. You need BAM files that show the coverage of your own reads mapped against your own contigs, as well as other BAM files showing other people’s reads mapped against your contigs.
Fortunately for you, I made all of these BAM files for you ahead of time. They are in this folder:
Choose three mapping files from that folder. The ones you choose should all be from your own dataset. The names are like this, as an example:
Where ERR599142 is the reference, and reads from ERR598972 (which are from coastal South Africa) were mapped to it. You want only the bam files with your own sample number as the reference.
You should also include the bam file of your own reads mapped against your own contigs.
cp [path to sorted .bam files] . cp [path to sorted .bai files] .
cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_North_Pacific_ERR599142_sorted.bam . cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_North_Pacific_ERR599142_sorted.bam.bai . cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_Southern_Ocean_ERR599008_sorted.bam . cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_Southern_Ocean_ERR599008_sorted.bam.bai . cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_Southern_Ocean_ERR599090_sorted.bam . cp /workspace/data/Genomics_Bioinformatics_shared/Tara_mappings/ERR599142_assembly_vs_Southern_Ocean_ERR599090_sorted.bam.bai .
Finally, copy your contigs over as well. They are probably in your assembly directory. For example:
cp ../ERR598983_assembly/ERR598983_assembly_reformatted.fa .
4. Get gene calls and annotations from Prokka¶
The first thing you have to do is make contigs database, which contains the sequences of your contigs, plus lots of information about those contigs. This includes information from Prokka– so we have to take the information from Prokka and put it in our contigs database.
First, you have to run a script on your Prokka files to convert them into a text file that we can import into anvi’o. Navigate to wherever your Prokka results are for your project assembly, and run a script to extract information from that file. Then copy it to your anvi’o folder and change directory to your anvio folder. For example:
cd prokka_project gff_parser.py PROKKA-10072018.gff --gene-calls prokka-gene-calls.txt --annotation prokka-gene-annot.txt cp prokka-gene-calls.txt ../anvio cp prokka-gene-annot.txt ../anvio cd ../anvio
5. Make the contigs database¶
Now, make the contigs database. It will have lots of information about… well… your contigs.
anvi-gen-contigs-databaseis the anvi’o script that makes the contigs database.
–fis the fasta file with your contigs that you have already assembled and fixed.
–oprovides the name of your new contigs database.
external_gene_callsprovides the name of the Prokka file you just made so you can import the Prokka calls into your contigs database
--ignore-internal-stop-codonswill ignore any internal stop codons in your gene calls. Sometimes these will get included in your Prokka results by accident, but for our purposes we can ignore them.
anvi-gen-contigs-database -f [your formatted, assembled contigs] -o contigs.db --external-gene-calls prokka-gene-calls.txt --ignore-internal-stop-codons
anvi-gen-contigs-database -f ERR599142_assembly_reformatted.fa -o contigs.db --external-gene-calls prokka-gene-calls.txt --ignore-internal-stop-codons
6. Import the Prokka annotations¶
Import your prokka results like this:
anvi-import-functions -c contigs.db -i prokka-gene-annot.txt
7. Search for single copy universal genes¶
Now we will search our contigs for archaeal and bacterial single-copy core genes. This will be useful later on because when we try to disentangle genomes from this metagenome, these single-copy core genes can be good markers for how complete your genome is.
This process is slow, so we’re going to run it on 5 CPUs rather than just 1. You can run it on screen in the background while you move forward with step 8. It should take a little under 10 minutes.
screen -S anvio anvi-run-hmms -c contigs.db -T 5
8. Determine taxonomy using Centrifuge¶
Now we are going to figure out the taxonomy of our contigs using a program called centrifuge. Centrifuge is a program that compares your contigs to a sequence database in order to assign taxonomy to different sequences within your metagenome. We’re going to use it first to classify your contigs.
First, export your genes from anvi’o.
anvi-get-sequences-for-gene-calls -c contigs.db -o anvio-gene-calls.fa
9. Run Centrifuge¶
centrifuge -f -x /usr/local/CENTRIFUGE/p_compressed anvio-gene-calls.fa -S centrifuge_hits.tsv
10. Import Centrifuge data¶
Now import those centrifuge results for your contigs back in to anvi’o. Anvi’o can automatically read and import centrifuge output.
anvi-import-taxonomy-for-genes -c contigs.db -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge
11. Run single copy gene taxonomy¶
Now we’re going to run one more thing to help us better determine the taxonomy of your MAGs.
anvi-run-scg-taxonomy -c contigs.db --scgs-taxonomy-data-dir /Accounts/Space_Hogs_shared/anvio-scg-databases/
Incorporating mapping data¶
We’ve just decorated our contigs database with all kinds of things. Now, we incorporate all of the BAM files.
12. Import mapping files into anvi’o with anvi-profile¶
Now anvi’o needs to combine all of this information (your mapping, your contigs, your open reading frames, and your taxonomy) together. To do this, use anvi-profile.
anvi-profileis the name of the program that combines the info together
–iflag provides the name of the sorted bam file that you copied in the step above.
-Tflag sets the number of CPUs. There are ~13 of you, and 96 to spare. For now, let’s set it to 5 so we don’t blow up the server.
-Mflag sets a minimum contig length. In a project for publication, you’d want to use at least 1000, because the clustering of contigs is dependent on calculating their tetranucleotide frequencies (searching for patterns of kmers). You need to have a long enough contig to calculate these frequencies accurately. But for our purposes, let’s use 500 so you can use as many contigs as possible.
anvi-profile -i [your sorted bam file] -c contigs.db -T 5 -M 500
anvi-profile -i ERR599142_assembly_vs_North_Pacific_ERR599142_sorted.bam -c contigs.db -T 5 -M 500
Do this for each of your sorted bam files.
13. Merge them together with anvi-merge¶
Now merge all of these profiles together using a program called anvi-merge. You have to merge together files in directories that were created by the previous profiling step. The asterisk * is a wildcard that tells the computer, ‘take all of the folders called ‘PROFILE.db’ from all of the directories and merge them together.’
We’re also going to tell the computer not to bin these contigs automatically (called ‘unsupervised’ binning), we want to bin them by hand (‘supervised’ binning). So we use the –skip-concoct-binning flag.
This step will take a couple minutes.
anvi-merge */PROFILE.db -o SAMPLES-MERGED -c contigs.db
Visualizing and making your bins¶
Now the fun part with pretty pictures! Type this to open up the visualization of your contigs (of course, change the port number to the one you were assigned):
anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db -P 8080
15. Visualize in browser¶
Now, open up a browser (Chrome works well) and type this into the browser window to stream the results directly from the server to your browser. NOTE that each of you will be assigned a different port number (i.e. 8080, 8081, 8082, etc); use the one you logged in with in the first step.
Click ‘Draw’ to see your results! You should see something like this: anvio screenshot
What you are looking at:
- the tree on the inside shows the clustering of your contigs. Contigs that were more similar according to k-mer frequency and coverage clustered together.
- the rings on the outside show your samples. Each ring is a different sample. This is a visualization of the mapping that you did last week, but now we can see the mapping across the whole sample, and for all samples at once. There is one black line per contig. The taller the black line, the more mapping for that contig.
- the ‘ribosomal proteins’ ring shows contigs with hits to known ribosomal proteins (useful for some applications)
- the ‘taxonomy’ ring shows the Centrifuge designation for the taxonomy of that particular contig.
- the ‘GC content’ ring shows the average percent of bases that were G or C as opposed to A or T for that contig.
16. Make bins¶
Once you get to this step, come find Rika in gather.town (in groups would be nice) or Zoom and I’ll show you how this works. (If you’re working asynchronously, please either wait until lab time or– if that’s not possible– ping Rika and we’ll find a time when I can show you how it works.)
Because your datasets are fairly small, your bins are also going to be very small. Your percent completeness will be very low. Try to identify ~3-5 bins according to patterns in the mapping of the datasets as well as the GC content.
When you are done making your bins, be sure to click on ‘Store bin collection’, give it a name (‘my_bins’ works), and then click on ‘Generate a static summary page,’ click on the name of your bin collection (e.g. “my_bins”), and then click on the link it gives you. It will provide lots of information about your bins. In the boxes under the heading ‘taxonomy,’ you can click on the box to get a percentage rundown of how many contigs in your bin matched specific taxa according to centrifuge, if any matched.
Once you have completed your binning process, take a screenshot of your anvi’o visualization and save it as ‘Figure 1.’ Write a figure caption explaining what your project dataset is, and which datasets you mapped to your sample.
17. Finding bin information¶
You will find your new bin FASTA files in the directory called
~/project_directory/anvio/SAMPLES-MERGED/SUMMARY_my_bins. I’ll describe all this information below for reference; it may come in handy if you decide to use this for your final project.
bins_summary.txtprovides just that, with information about the taxonomy, total length, number of contigs, N50, GC content, percent complete, and percent redundancy of each of your bins. This is reflected in the summary html page you generated earlier when you clicked ‘Generate a static summary page.’
If you go to the directory
bins_across_samples, you will find information about all of your bins across all samples, such as:
mean_coverage.txt, which gives the average coverage of your bins across all samples
variability.txt, which gives you the number of single nucleotide variants (SNVs) per bin in each sample
If you want to know what the rest of these files mean, look here.
If you go to the directory
bin_by_bin, you will find a series of directories, one for each bin you made. Inside each directory is a wealth of information about each bin. This includes (among other things):
- a FASTA file containing all of the contigs that comprise your bin (i.e.
- a file with all of the gene sequences and gene annotations in your bin (i.e.
- mean coverage of your bin across all of your samples (i.e.
- files containing copies of single-copy, universal genes found in your contigs (i.e.
- information about single nucleotide variability in your bin– the number of SNVs per kilobase pair. (i.e.
Analyzing your bins¶
OK, now the fun part! There are LOTS of things you can do with all of this information now, but for the purposes of this lab, let’s do two things.
18. Make a heatmap showing the relative coverage of the bins in each sample¶
If we want to visualize how abundant each of these bins is across samples, for example to say something about which genomes might be more abundant in different places, type:
make_bin_coverage_heatmap.py [path to mean_coverage.txt file]
The script will spit out a PDF file called
bin_covg_heatmap.pdf. Copy it over to your local computer using
scp to take a look. You should see something that looks like this:
The darker the color, the higher the coverage of that bin in that particular sample. The bins are clustered according to how similar their abundance patterns are (see the dendrogram at the top to see the clustering). The figure should give you a sense of how abundant these genomes are in different sites across the ocean.
Important side note: Most of the Python scripts you’ve been using are scripts that I wrote to parse and visualize these data files. I designed this class to be as open as possible and therefore I have not required you to come in with coding skills. However, for future reference, if you are thinking of going into a career in which bioinformatics is likely to be a prominent part of your daily life (as we should all strive to do!), then I strongly recommend taking a CS or stats class to formally learn something like Python or R to help with data parsing and visualization. You won’t regret it!
19. Visualize the functional potential of each of these genomes¶
If you’re wondering why these genomes are abundant in different areas– i.e., what makes these microbes different from each other that enables some to survive better in some regions? What types of metabolism do they encode? Do they have viruses inside them?– you could look at the files in
SAMPLES-MERGED/SUMMARY_my_bins/bin_by_bin/BinX/Bin_X-gene_calls.txt, which tells you what types of genes were in that bin. If you’re focusing on this for your final project, this file is where you should go for a detailed look at the gene content.
You can also get a more bird’s-eye-view of gene content with a visual summary of those gene calls by doing this:
get_bin_functions.py [full path to the anvi'o 'bin_by_bin' directory]
This will give you an output file called
for_heatmap.txt which you’ll use in the next command. Type this:
This will create a figure called
bin_function_heatmap.pdf that should look like this:
You’ve created a presence-absence grid that shows your bins at the bottom, and along the side are a bunch of gene categories. If your bin contained a gene in that category, the box for that category is colored dark green. As before, the bins are clustered according to the similarity of their functional patterns. Comparing this functions grid with your coverage heatmap might provide you with some insights (or maybe just more questions!) as to why some bins/genomes were more abundant in some regions than others.
While looking at your grid, keep in mind that these bins may not be very complete, so absence of evidence is not necessarily evidence of absence.
For this week’s post-lab assignment, you won’t do a mini-research question because the quality of your bins may vary, through no fault of your own. (That’s real data for you…) So you get a bit of a break this week. Please submit three figures:
Figure 1, the
bin_function_heatmap.pdf, each with figure captions, to Moodle by the start of lab next week.
Final step: sharing data¶
Some of you might want access to each others’ anvio data for your final projects. So let’s share it on in our shared folder.
First, make a directory with your name on it, and then put your contigs database and SAMPLES_MERGED directory in there. For example:
mkdir /Accounts/Genomics_Bioinformatics_shared/anvio_stuff/username cp contigs.db /Accounts/Genomics_Bioinformatics_shared/anvio_stuff/username cp -r SAMPLES_MERGED /Accounts/Genomics_Bioinformatics_shared/anvio_stuff/username