Week 4: Mapping with bowtie2¶
1. Log in to the remote server¶
Log onto the Carleton VPN if you haven’t already. Open up your Terminal window and ssh in to baross. If you’re working synchronously, head over to gather.town and hang out at one of the tables.
Mapping with a toy dataset¶
2. Make new directory¶
We’re going to start by mapping the sequencing reads from a genome sequence of a type of archaeon (Sulfolobus acidocaldarius) against a scaffold from a very closely related species.
The sequencing reads from from one strain of Sulfolobus acidocaldarius, and the reference sequence that they are mapping to is from a very closely related strain of Sulfolobus acidocaldarius.
In your home directory, make a new directory called “mapping,” then change into that directory.
cd ~ mkdir mapping cd mapping
3. Copy data¶
Copy this week’s toy datasets into your directory. You’re going to copy:
- the reference sequence (toy_dataset_contig_for_mapping.fasta)
- the reads that you will map to the reference (toy_dataset_reads_for_mapping.fasta)
cp /usr/local/data/toy_datasets/toy_dataset_reads_for_mapping.fasta . cp /usr/local/data/toy_datasets/toy_dataset_contig_for_mapping.fasta .
4. Build index¶
You’re going to be mapping short reads to a longer reference sequence. The first thing you have to do is prepare an index of your reference so that the mapping software can map to it.
bowtie2-build toy_dataset_contig_for_mapping.fasta toy_dataset_contig_for_mapping.btindex
bowtie2-buildis the program that indexes your reference.
- The first argument gives the reference dataset name.
- The second argument provides the name you want to give to the index.
Now, map! This will take a few steps.
First, you make what is called a SAM file. It’s a human-readable version of a BAM file, which we read about in the Zimmer “Game of Genomes” articles.
bowtie2is the name of the mapping program.
-xis the flag that provides the name of the index you just made.
-fmeans that the reads you are mapping are in fasta, not fastq, format.
-Umeans that the reads are not paired. (They aren’t in this dataset.)
-Sprovides the name of your output file, which is in SAM format.
bowtie2 -x toy_dataset_contig_for_mapping.btindex -f -U toy_dataset_reads_for_mapping.fasta -S toy_dataset_mapped_species1.sam
6. Look at SAM file output¶
If you look at the output file with
less, you can see that it is human-readable (sort of). It tells you exactly which reads mapped, and where they mapped on the reference, and what the differences were between the reference and the mapped reads. This can sometimes be useful if you want to parse it yourself with your own scripts– but there’s a whole suite of tools in a package called
samtools that we’ll rely on to do that next.
Now you will use a package called
samtools to convert the SAM file into a non-human-readable BAM file. Carl Zimmer spent a lot of time trying to obtain his BAM files in that “Game of Genomes” article– now you get to make one yourself! (But, uh, not of your own genome, obviously…)
samtools view -bS toy_dataset_mapped_species1.sam > toy_dataset_mapped_species1.bam
samtoolsis a package used to manipulate and work with mapping files. samtools view is one program within the whole samtools package.
- The flag
-bSis not BS! It tells samtools to convert a sam file to a bam file. (Bioinformatics jokes = still not very funny.)
8. samtools sort¶
And because this is so fun, we get to do some more bookkeeping. Sort your bam file so that later programs have an easier time parsing it:
samtools sort toy_dataset_mapped_species1.bam -o toy_dataset_mapped_species1_sorted.bam
samtools sortis the name of the program used for sorting
- The first argument provides the name of the bam file you want to sort
-oflag gives the name of the output file you want.
9. Index the reference with samtools¶
In order to visualize your mapping, you have to index your reference. Because indexing is SO MUCH FUN. This time we are indexing with samtools instead of bowtie2.
(NOTE!!: you only need to do this step if you’re going to visualize your mapping with IGV, as we’re about to do now. In the future, if you don’t intend to visualize your mapping, then you don’t need to bother with this step.)
samtools faidx toy_dataset_contig_for_mapping.fasta
samtools faidxis the name of the program that indexes the reference.
- The first argument provides the name of the index, which should be your reference file.
10. Index the bam file¶
Almost there! Now you index the bam file that you just made because WE LOVE INDEXING.
samtools index toy_dataset_mapped_species1_sorted.bam
samtools indexis the name of the program that indexes the bam files.
- The first argument provides the name of a sorted bam file.
11. Copy to local computer¶
Now we’re going to visualize this. Copy the entire “mapping” folder over to your local computer using scp (or Filezilla if you prefer.)
scp -r firstname.lastname@example.org:~/mapping/ .
Remember, to use scp, you should open a new Terminal window that is NOT logged in to baross. The above command copies the folder at toy_dataset_directory/mapping to your local computer in whatever folder you happen to be in. You could tell the computer to put it on the Desktop or wherever you like by simply providing the path instead of a period in the second part of the command.
12. Visualize in IGV¶
Find the IGV Viewer on your local computer and open it. First, click ‘Genomes’ –> ‘Load Genome from File’ and find your reference file (
toy_dataset_contig_for_mapping.fasta). Next, click ‘File’ –> ‘Load from File’ and open your sorted bam file (
toy_dataset_mapped_species1_sorted.bam). You should be able to visualize the mapping. Along the top, you’ll see the coordinates of your reference sequence. Below that, you’ll see a graph showing the coverage of each base pair along your reference sequence. Below that, you’ll see each read mapped to each position. The arrows indicate the direction of the read; white reads are reads that mapped to two different locations in your reference. Single nucleotide variants in the reads are marked with colored letters; insertions are marked with a purple bracket, and deletions are marked with a horizontal black line. More information can be found at the link below.
Link to IGV viewer
13. Compare mappings¶
We’re going to compare and contrast this mapping with another one. Now we’re use the sequencing reads from a third very closely related strain of Sulfolobus acidocaldarius, and we’re going to map those reads to the original reference sequence so that we can compare the mapping.
All of the commands are listed below. First, you will copy the second file to your directory (see below). You have already indexed the reference file. Then you map, convert SAM to BAM, sort, and then index.
cd ~/mapping/ cp /usr/local/data/toy_datasets/toy_dataset_reads_for_mapping_species2.fasta . bowtie2 -x toy_dataset_contig_for_mapping.btindex -f -U toy_dataset_reads_for_mapping_species2.fasta -S toy_dataset_mapped_species2.sam samtools view -bS toy_dataset_mapped_species2.sam > toy_dataset_mapped_species2.bam samtools sort toy_dataset_mapped_species2.bam -o toy_dataset_mapped_species2_sorted.bam samtools index toy_dataset_mapped_species2_sorted.bam
14. Visualize new mapping¶
Copy the new data files to your local computer using scp like in step 11. Visualize both of them in IGV viewer. Since you have already loaded the reference file and reads from your first mapping, all you have to do is click ‘File’ –> ‘Load from File’ and click on
toy_dataset_mapped_species2_sorted.bam. You should be able to see them side by side.
If you’re working synchronously and are in gather.town, pause to make sure you’re understanding what you’re seeing by answering the questions on the Google Doc.
Mapping your project datasets¶
Now we’re going to map your project datasets. Remember that these are metagenomes, not a genome, so the data will be a bit more complex to interpret.
We’re going to map your reads against your assembled contigs. Why would we do this, you ask? A few reasons:
- to look for single nucleotide variants in specific genes.
- to quantify the relative abundances of different genes, and determine whether specific genes have better coverage than others.
- to quantify the relative abundances of specific taxa, and determine whether specific taxa are more abundant than others.
15. Map to your project datasets¶
Change directory into your project dataset directory folder.
We’re going to map your raw reads against your assembled contigs (not your ORFs). Make sure you know where your project assembly is and where your raw reads are. Follow the instructions to map your raw reads back to your assembled contigs. For example, if you were mapping the reads in the dataset
ERR599166_1mill_sample.fasta against an assembly called
ERR599166_assembly_reformatted.fa, you might do something like this (below). Please be sure to use the assembled files that you’ve already run through anvi-script-reformat-fasta, which you should have done in our first computer lab.
One more thing. Your project datasets are very large– you each have 10 million reads. So mapping will take longer than for the toy datasets. So we’re going to add an extra flag (-p) to the mapping step to tell the computer to use more than one CPU so that this goes faster. You’ll each use 5 CPUs for this process. The mapping may still take about 5-10 minutes, so have patience!
An example set of commands is shown below. Remember to replace the dataset names here with your own project datasets!
bowtie2-build ERR599166_assembly_reformatted.fa ERR599166_assembly_reformatted.btindex bowtie2 -x ERR599166_assembly_reformatted.btindex -f -U ERR599166_sample.fasta -S ERR599166_mapped.sam -p 5 samtools view -bS ERR599166_mapped.sam > ERR599166_mapped.bam samtools sort ERR599166_mapped.bam -o ERR599166_mapped_sorted.bam samtools faidx ERR599166_assembly_reformatted.fa samtools index ERR599166_mapped_sorted.bam
Note that this command assumes that your raw reads and your assembly are in the same directory you’re in. If they are not, you will need to either copy them over or use the correct path in your commands. (A reminder: the path simply gives directions to the computer for where to find a file.) For example, if you are in a mapping directory, your reads file is one directory up in the file hierarchy, and your assembled reads are in your assembly file, you might have to type something like this:
bowtie2 -x ../assembly/ERR599166_assembled.btindex -f -U ../ERR599166_1mill_sample.fasta -S ERR599166_mapped.sam
When you visualize this in IGV, remember that you have multiple contigs. So you have to click the drop-down menu at the top and choose which contig you wish to visualize.
17. Calculating coverage- generate bed file¶
You were able to visualize the mappings in IGV, but sometimes you just want to have a number: for example, you might want to know the average coverage across a specific gene, and compare that to the average coverage of another gene in order to compare their relative abundances in the sample. So, next we’re going to calculate gene coverages based on your mapping.
First we’re going to make what’s called a bed file. We will use it to find the average coverage of every single open reading frame in your dataset. Please make sure that your contigs have names that are something like
c_00000000001 and your ORFs have names that are something like
make_bed_file_from_gff_file_prokka.py [your gff file from prokka]
This will create a bed file that ends in .bed. You can take a look at it if you wish– it should have the contig name, the coordinates of your ORF, and the name of your ORF.
18. Run script to calculate coverage¶
Now run a script that will use your bed file and will calculate the read depth for every single ORF in your ORF file. You might have to provide the path to your bed file, or copy it into your project directory.
samtools bedcov [your bed file] [your sorted bam file] > [ an output file that ends in _ORF_coverage.txt]
samtools bedcov PROKKA_09222020.bed ERR599166_mapped_sorted.bam > ERR599166_ORF_coverage.txt
It is extremely important that you ran both bowtie2 and prokka on the reformatted assembly! If you get errors, this is probably because you did not run them against the reformatted assembly!!
How do you check that you ran it on the reformatted dataset? First, check your Prokka output (replace with the name of your PROKKA file):
It should look something like this, with the second column showing numbers like c_000000000001 and so on.
##gff-version 3 ##sequence-region c_000000000001 1 2569 ##sequence-region c_000000000002 1 1660 ##sequence-region c_000000000003 1 1376 ##sequence-region c_000000000004 1 1037 ##sequence-region c_000000000005 1 945 ##sequence-region c_000000000006 1 917 ##sequence-region c_000000000007 1 902 ##sequence-region c_000000000008 1 892 ...
If it looks more like this, you did not run prokka against the reformatted dataset.
##gff-version 3 ##sequence-region contig-100_0 1 44454 ##sequence-region contig-100_1 1 44231 ##sequence-region contig-100_2 1 35217 ##sequence-region contig-100_3 1 29595 ##sequence-region contig-100_4 1 29534 ...
If it looks like this, then you should run
anvi-script-reformat again and then run prokka again. For example:
anvi-script-reformat-fasta contig-100.fa -o ERR599899_assembled_reformatted.fa -l 0 --simplify-names prokka ERR599899_assembled_reformatted.fa --outdir prokka_ERR599899
You should also make sure that you ran bowtie2 against the reformatted version. How do you know? Go back up up to Step 13 in this week’s protocol and check the assembly file with
less. Make sure the contigs start with
c_00000001 and not
19. Calculate coverage in Excel¶
samtools bedcov command will give you a file that ends in
scp to move it to your local computer, then open with Excel. It should give the name of your contig, the start coordinate, the stop coordinate, the name of your open reading frame, and then the sum of the per-base coverage. To get the average coverage, divide the sum of the per-base coverage by the difference between the stop and start coordinates. In other words, type this in the top column and then fill down to the bottom:
Pro tip: Rather than drag for 16,000 rows in Excel, highlight the top cell, then scroll down to the bottom of the column, then hold ‘Shift’ while you click the bottom cell of the column where the data ends, then click Edit –> Fill–> Down.
Now you have the average coverage of every ORF for this particular bam file, and you should be able to match the name of your ORF from Prokka to the ORF in this output file.
20. Check for understanding¶
This is a really common type of analysis for ‘omics-based studies– you can compare the coverage of specific genes of interest. For example, you might compare the coverage of genes related to photosythesis, respiration, and nitrogen fixation if you’re interested in how abundant those metabolisms are in the community. We also use a very similar technique when we’re doing expression analyses using something like RNASeq (more on that in coming weeks).
If you’re working synchronously and are in gather.town, check that you’re understanding by discussing the questions in the Google Doc.
22. This week’s postlab writeup¶
For this week’s post-lab writeup:
Mini Research Question
Write either a question or generate a hypothesis about the relative coverage of this set of genes with respect to your project datasets.
If you’re having trouble thinking of a good question, come find Rika in gather.town or Slack, tell me what you’re interested in, and we can bounce around some ideas.
Example #1: I hypothesize that there will be lower coverage of genes related to photosynthesis (i.e. the psb genes) in the mesopelagic zone relative to the surface. This is because at the surface there will be more organisms that photosynthesize compared to the mesopelagic zone, where less light is available. Therefore, a lower proportion of genes in the microbial community in the mesopelagic zone will be related to photosynthesis compared to the surface, and therefore, fewer reads will map to photosynthesis genes in the mesopelagic zone.
Example #2: I hypothesize that there will be higher coverage of genes related to viruses in my sample relative to the deeper samples because there are more viruses in surface waters than in deeper waters, simply because there are more organisms to infect in surface waters.
Once you’ve identified the set of genes related to a specific metabolism/function/type of organism, and you have written a question or generated a hypothesis, find the average coverage to each of those ORFs in your dataset. Remember that more than one ORF may have that function.
Many of you will probably want to compare your mapping of your own reads to your own dataset to a mapping made by one of your classmates to their own dataset. The bam files and ORF coverage files should be saved in /Accounts/Genomics_Bioinformatics_shared/mapping.
Describe your results and create at least one graph to visualize those results. This should represent a mini ‘Results’ section in a lab report or paper. Interpret your results within the context of the ecosystem you are investigating. This should represent a mini ‘Discussion’ section in a lab report or paper.
Submit on Moodle by lab time next week.