Week 2: Introduction to Unix/Linux and the Server; Assembly with IDBA-UD; ORF Calling and Annotation with Prokka¶
Rika Anderson, Carleton College
Getting help in lab this week and onwards¶
Questions? Problems? Running into errors? Post on the lab Slack. I’ll be monitoring the Slack during the weekly lab sessions (and outside of those times as well, though I can’t always guarantee a rapid response). It’s easiest to answer your questions if you post a thorough explanation of exactly what command you ran and exactly what error message you’re getting. If you can, please answer each others’ questions too– some of you may run into the same problems.
While you’ll be working through the labs independently, there are two steps that involve group discussion. If you are working on lab synchronously, hop on gather.town (see Moodle for link) while you’re working. Sit in groups if ~3-4 at Tables 1-5. You can turn videos and mics off while you work, but you can turn videos/mics on when you want to ask each other questions and for the group discussion phase. I’ll be logged on and you can find me in “Rika’s corner” if you have a question that can’t be answered easily on Slack or just to say hi. If you’re on a different time zone or can’t work synchronously for whatever reason, just move on to the next step, and you can post questions on Slack if they come up.
Connecting to baross¶
1. About baross¶
We are going to do most of our computational work on a remote server (or computer) called baross, which is a remote server with 96 CPUs, 50 TB of storage, and 768 GB RAM. baross lives in the basement of the CMC. You can access it from lab computers on campus, and you can also access it from your own computer at home. First, you have to learn how to access baross.
For everyone: please log on to the Carleton VPN first if you can.
2. Opening Terminal¶
If you’re on a Mac, find and open the Terminal application (it should be in “Applications” in the folder called “Utilities”). If you’re on a PC or other OS, open the window you’ll use to ssh into a remote server (like Ubuntu or PuTTY or something similar).
The terminal is the interface you will use to log in to the remote server for the rest of the course. It is also the interface we will be using to run almost all of the bioinformatics software we will be learning in this course. You can navigate through files on the server or on your computer, copy, move, and create files, and run programs using the Unix commands we learned earlier this week.
We will use something called ssh, or a secure socket shell, to remotely connect to another computer (in this case it will our class server, baross). Type the following (substituting username with your own Carleton username– the same name as your email address):
4. ssh (cont.)¶
You should see something like this:
[your username]@baross.its.carleton.edu's password:
Type in your Carleton password. NOTE!! You will not see the cursor moving when you type your password. Never fear, it is still registering what you type. Be sure to use the correct capitalization and be wary of typos.
Whenever you ssh in to baross, you end up in your own home directory. Each of you has your own home directory on baross. To see where you are, print your current working directory.
# How to print your working directory (find out where you are in the system) pwd
6. Your path¶
You should see something like this:
This tells you what your current location is within baross, also known as your path. But before we actually run any processes, we need to learn a few important things about using a remote shared server.
Creating an assembly and evaluating it (toy dataset)¶
Let’s say we’ve gone out in to the field and taken our samples, we’ve extracted the DNA, and we’ve sent them to a sequencer. Then we get back the raw sequencing reads. One of the first things we have to do is assemble them. To do that, we’re going to use a software package called IDBA-UD. Bioinformaticians love to debate about which assembler is best, but ultimately it usually depends on the nature of your own dataset. If you find yourself with data like this someday and you want to know which assembler to use, my advice is to try a bunch of them and then compare the results using something like Quast, which we’ll test below. For now, we’re going to use IDBA-UD, which I’ve found to be a good general-purpose assembler.
Make a new directory in your home folder:
Change directory into your toy dataset directory:
11. Copy toy dataset¶
Copy the toy dataset into your assembly directory. Don’t forget the period at the end! This means you are copying into the directory you are currently in.
cp /usr/local/data/toy_datasets/toy_dataset_reads.fasta .
12. Run idba-ud on toy dataset¶
Run idba-ud on the toy dataset. Here is what these commands mean:
- Invoke the program
-rgives it the “path” (directions) to the reads for your toy dataset.
../means it is in the directory outside of the one you’re in.
-oflag tells the program that you want the output directory to be called “toy_assembly”
idba_ud -r toy_dataset_reads.fasta -o toy_assembly
13. cd to output directory¶
When that’s done running, go into your new output directory and take a look at what’s inside:
cd toy_assembly ls
14. The output files¶
You should see several files, including these:
contig.fa contig-20.fa contig-40.fa contig-60.fa contig-80.fa contig-100.fa scaffold.fa
We will talk more about how assembly works in class. Briefly, De Bruijn graph assembly works by searching reads for overlapping words, or “kmers.” IDBA-UD starts by searching for small kmers in the short reads to assemble those short reads into longer contigs. Then it uses those constructed contigs for a new round of assembly with longer kmers. Then it does the same with even longer kmers. It iterates through kmers of length 20, 40, 60, 80, and 100. Each “contig-“ file gives the results of each iteration for a different kmer size. The “scaffolds.fa” file provides the final scaffolded assembly, which pulls contigs together using the paired-end data.
15. Examine output¶
Let’s examine the assembly output files. First, take a look at your final scaffold output:
16. Fasta file¶
You should see a fasta file with some very long sequences in it. When you’re done looking at it, type:
17. Evaluate assembly quality¶
Now we’re going to evaluate the quality of our assembly. One measure of assembly quality is N50, which is the contig length for which the collection of all contigs of that length or longer contains at least 50% of the sum of the lengths of all contigs. To do that, we will use an assembly evaluator called Quast. Run it on your toy dataset. (Note that the command is kind of long, so scroll right in the little green box below.)
quast.py contig-20.fa contig-40.fa contig-60.fa contig-80.fa contig-100.fa scaffold.fa
Here is what the commands mean:
- Invoke the program
- We tell it to run the program on all of the output files of your toy assembly. Every fasta file that we list after
quast.pywill be included in the analysis.
Quast called the output
quast_results by default. You can change that using the
mv quast_results toy_assembly_quast_evaluation
18. View output by copying to your local computer¶
Quast gives you some nice visual outputs, but it’s easier to look at these if you copy them from the server to your local computer. For this, we will use
scp, or ‘secure copy.’ It’s a lot like
cp, except that you’re copying from a remote server to your own computer. We will use
scp -r because we’re copying a directory recursively, not just a file.
Open up another Terminal window on your own computer. Don’t ssh into anything. Navigate to your home directory.
Type this (substituting your own username below):
scp -r email@example.com:~/toy_dataset_directory/toy_assembly/toy_assembly_quast_evaluation .
That means: copy something from the remote server baross and put it right here on my local computer.
Find your newly copied folder on your own computer. Double-click on the file called “report.html” and discuss it with your group (below).
19. Pause to check for understanding¶
For those of you able to work synchronously: In gather.town, discuss the questions in the shared Google Doc at your table (press x to access). For those of you doing this on Tuesday, we likely won’t have covered the details of kmers and N50 yet, so discuss the patterns you see and we’ll fill in the concepts in class. For those of you doing this on Thursday, you should be able to explicitly discuss kmer length and N50 with regards to your assemblies. Feel free to find Rika to ask questions if something isn’t working or doesn’t make sense. When you’re done, move on to stop 20 below.
For the sake of future bookkeeping, you’ll need to modify your scaffold assembly file so that another software program we will use in the future (called
anvi’o) is happy. (Spaces and other weird characters are so tricky sometimes.) This will be very important for your project datasets as well. Do this while you are inside the directory with your assembly files:
anvi-script-reformat-fasta scaffold.fa -o toy_dataset_assembled_reformatted.fa -l 0 --simplify-names
Create and evaluate assembly (project dataset)¶
You’re now going to run an assembly on your project dataset. Before we start working on your project datasets, I encourage you all to start a computational lab notebook. I recommend that you open a text document in BBEdit or a similar text editor and save it on the cloud somewhere. In that notebook, it will be crucial that you keep a record of all the commands you run, especially for your project datasets. That way, if you get a funny result, or if you want to repeat an analysis, you can always go back to your lab notebook to see exactly what command you ran. In my own computational lab notebook, I record the dates and I include notes to myself about the analysis, or observations of the graphs I make. I recommend using a text document (like in BBEdit) rather than Word or Google Docs because you can copy-paste Unix commands from a text editor without worrying about formatting. These notebooks are just for you– I won’t be checking them– but now that we’re about to get into the real analysis, you should start one for your project datasets and maintain them throughout the rest of this course.
Since the assembly will take about 10-20 minutes, I recommend running this on screen. Type:
screen -S assembly
22. Make and change directories¶
Make a new directory in your home folder called “project_directory,” and then change directory into your newly created directory:
cd ~ mkdir project_directory cd project_directory
23. Copy project dataset¶
Next, copy your project dataset in to that directory. Your assigned project dataset is listed in an Excel file that is on Moodle.
For example, if you have dataset
ERR590988 from the Arabian Sea, you would do this:
cp /usr/local/data/Tara_datasets/Arabian_Sea/ERR598966_sample.fasta .
24. Start assembly¶
Next, start your assembly. Substitute the name of your own sample dataset for the dataset shown here.
idba_ud -r ERR598966_sample.fasta -o ERR598966_assembly
25. Detach from screen¶
After you’ve started the process, detach from screen:
These assemblies will likely take some time. You can move on to the next steps (“Searching for and annotating open reading frames”) while this runs, and then come back to this when it’s done.
To check on the progress of your assemblies, you can either use
top to see if your assembly is still running, or you can re-attach to your screen using the instructions above.
Searching for and annotating open reading frames¶
Now that we have assembled our contigs, we want to find genes on them. That requires identifying open reading frames (ORFs), and then comparing the ORF sequences with existing databases to see if we can figure out what kinds of genes they are. There are lots of programs to identify and annotate ORFs. We’re going to use a program called Prokka, which wraps a bunch of different software into a pipeline that is nice and streamlined. Prokka basically does three things:
- Identify ORFs
- Compare those ORFs to databases of known genes to find the closest match and assign their putative functions
- Several other things (like finding rRNA or tRNA genes, CRISPRs, etc.) that we aren’t going to worry about right now.
Prokka works well and it’s free, so we’re using it today. It works best for archaeal and bacterial genomes. If you want to identify and annotate ORFs for eukaryotic genomes, there’s lots of similar software out there to do that.
Go to your toy dataset directory and make a new directory:
cd ~/toy_dataset_directory mkdir ORF_finding cd ORF_finding
26. Run Prokka¶
Now run Prokka on your toy assembly, which is located in the toy_assembly folder:
prokka ../toy_assembly/toy_dataset_assembled_reformatted.fa --outdir prokka_toy
This means you’re invoking prokka on your reformatted toy dataset assembly, and you’re putting it in a new directory called
27. View output in FASTA format¶
You should see a directory called
cd to go into that folder, then use the program
less to look at
PROKKA_09222020.faa (or something like that– adjust according the date). You should see a fasta file with amino acid sequences. Each amino acid sequence is an open reading frame (ORF), or a putative gene that has been identified from your assembly.
The cool thing is that Prokka has also annotated your genes with their putative function. You can see that in each sequence title, which provides the name of the sequence assigned by Prokka (e.g. KGLPOPID_00002) and the putative function (e.g. Proine/betaine transporter). A lot of them will say
hypothetical protein, which simply means that Prokka couldn’t find a good match for that protein in public databases.
Note that the file that ends in
.ffn contains the same thing, except the ORF sequences are in nucleotide format, not amino acid. And the file that ends in
.gbk is one commonly used in the National Institute of Health (NIH) National Centers for Biotechnology Information (NCBI) database, which we’ll be using a lot over the next few weeks.
28. View output in tab-separated column (tsv) format¶
Let’s look at one last output format, which is in tab-separated columns, and therefore best visualized in a spreadsheet application like Excel. Use
scp to copy this file from the server to your own computer. Remember, type this into a Terminal window that is NOT logged on to the server.
scp firstname.lastname@example.org:~/toy_dataset_directory/ORF_finding/prokka_toy/PROKKA_09252020.tsv .
29. Open tsv file¶
Find your file on your local computer, and open your
.tsv file in Excel (or Google Sheets). The column headers are in columns as follows, left to right:
locus_tag: the name that Prokka assigned the ORF
ftype(feature type): whether the thing it found is an ORF (CDS) or a tRNA or rRNA gene or something else.
length_bp: length of the gene
gene: if there was a match to an identified gene, it will give the name of that gene.
EC_number: the Enzyme Commission assigns numbers to known genes according to their function. If there was a match to a known gene with an EC number, that is given here.
COG: the Clusters of Orthologous Groups (COG) database groups proteins into categories of known function. If your ORF matched a specific COG, that is listed here. There is a website describing all of the COGS here.
product: The putative annotation for your ORF.
Annotate your own project datasets¶
Now that you have the general lay of the land, you’re now going to annotate your own project dataset assemblies using
30. Check on your project assemblies¶
Hopefully your project assemblies are done by now. Log back into your screen session (
screen -r assembly). When IDBA-UD finishes, it will say:
invalid insert distance Segmentation fault
This is because we gave IDBA-UD single-end reads instead of paired-end reads. Never fear! Despite appearances, your assembly actually did finish. However, the use of single-end reads instead of paired-end reads means that your assemblies will NOT have scaffolds, because we couldn’t take advantage of the paired-end data to make scaffolds. So, your final assembly is just the final contig file from the longest kmer, called
31. Rename assembly¶
When your project assembly is done, change the name of your final assembly from
contig-100.fa to a name that starts with your dataset number, followed by
_assembly.fasta. For example, you might call it something like
ERR598966_assembly.fasta. This way we will have standardized names and save ourselves future heartache. While in your project_directory, type:
mv ERR598966_assembly/contig-100.fa ERR598966_assembly.fasta
(You learned how to do this already– see your Unix cheat sheet.)
32. Run anvi-script-reformat-fasta¶
anvi-script-reformat-fasta on your completed project assemblies (see step 19 above). For example:
anvi-script-reformat-fasta ERR598966_assembly.fasta -o ERR598966_assembly_reformatted.fa -l 0 --simplify-names
33. Run Prokka on your project dataset¶
If you aren’t in your project directory right now, move into it and start
prokka. Remember that you’re still in screen. This shouldn’t take too long, but if you log out of screen, don’t forget to re-enter and then terminate the screen session!
cd ~/project_directory screen prokka [your assembly reformatted] --outdir prokka_project Ctrl+A d
Playing with “big data”¶
Congratulations! You now have annotations for your entire project dataset. This means you have annotations for thousands of genes in your dataset. Feeling overwhelmed? Feeling excited about the endless possibilities? Feeling hungry for some questionable yet tasty LDC food? If it’s one of the first two– welcome to big data! If it’s the last one, maybe lab has gone on for too long.
We’re almost done. But first we’re going to see an example of how we might analyze this kind of data.
34. COG categories¶
First, there’s a text file on the server that assigns every single COG in the COG database to a specific gene category, like
Translation, ribosomal structure, and biogenesis. You can see it by typing this:
35. Getting COG categories of your genes¶
Let’s say you want to know how many of the genes in your dataset are responsible for translation, how many are for energy metabolism, how many are viruses, etc. Fortunately, there is a magic script available for you to do that on the server. Change directories into wherever your Prokka results are and type this:
get_ORF_COG_categories.py [name of your Prokka tsv file]
And voilà! You’ll get a new file that ends in
_cog_categories.txt with a list of all the different COG categories and the number of genes that fall into that category. Some of the COGs fall into multiple categories, denoted with, for example,
Signal_transduction_mechanisms/Transcription. Imagine the possibilities! You can investigate so many questions about gene function in your dataset!
37. Compare datasets to prep for this week’s post-lab¶
If you are able to work synchronously: On gather.town, discuss the questions on the shared Google doc with your group. If it’s helpful, it’s possible to share your screen in gather.town. This will help you with this week’s post-lab assignment.
If you haven’t killed your screen yet, you should do so. As you did above, while in your screen session, type:
Ctrl A and then
k. The the computer will say:
Really kill this window [y/n]. You type:
When you are all done with your work on the server, you can log off the server by typing:
Lab assignment this week¶
Write a mini research question based on your ORF annotations. If you like, you can compare your sample to someone else’s. This doesn’t have to be overly in-depth, just ask a simple question that can be answered by the files you’ve generated today.
For example, you might ask, “Is there a difference in the number of genes related to the mobilome/prophage in the surface ocean compared to the mesopelagic zone?”
Or for example, you might ask “What genes are present in the mesopelagic zones that are missing from the surface ocean?”
Or, “Are there more genes labeled as “hypothetical” genes in the Antarctic Ocean compared to the North Pacific?”
Or, “Within a single sample, are there more genes related to energy metabolism or replication/recombination/repair?”
Make an Excel bar plot or table that demonstrates the answer to your question. In a paragraph or so, explain your results and speculate as to why your results look they way they do, and what else you might want to investigate related to this idea in the future. Submit this on Moodle by lab time next week.
I prefer to grade these blind, so please put your student ID number, rather than your name, on your assignment. (This applies to all future assignments as well.)