Bioinformatics Tools¶
Summary of purpose & usage of bioinformatics tools seen in class.
IDBA-UD¶
Assembles reads into contigs
Example usage (replace names in brackets with your own names and remove brackets):
idba_ud -r [reads file] -o [output folder name]
- The
-rgives it the “path” (directions) to the reads for your reads.../means it is in the directory outside of the one you’re in. - The
-oflag tells the program what you want the output directory to be called.
Prokka¶
Finds ORFs in assembled contigs and annotates them
Example usage (replace names in brackets with your own names and remove brackets):
prokka [assembled contigs fasta file] --outdir [name of output directory]
To get a summary of how many ORFs were assigned to different COG categories based on your Prokka data, do this:
get_ORF_COG_categories.py [name of your Prokka tsv file]
BLAST¶
Find a match to a sequence in your own sequence file, not the NCBI database
Example usage (replace names in brackets with your own names and remove brackets):
First, make a BLAST database:
makeblastdb -in [example sequence file] -dbtype [prot or nucl]
- -in gives the name of the file you want to BLAST against.
- for dbtype, use
protfor an amino acid file andnuclfor a nucleotide file.
Then, do your BLAST.
blastp¶
blastp -query [example query file] -db [example database file] -evalue 1e-05 -outfmt 6 -out [example_output.blastp]
blastpdoes a protein vs protein blast. (other choices areblastn,blastx,tblastn,tblastx.)-querydefines your blast query– in this case, the Pfam seed sequences for the CRISPR RAMP proteins.-dbdefines your database– in this case, the toy assembly ORFs.-evaluedefines your maximum e-value, in this case 1x10-5-outfmtdefines the output format of your blast results. option 6 is common; you can check out https://www.ncbi.nlm.nih.gov/books/NBK279675/ for other options.-outdefines the name of your output file. I like to title mine with the general formatquery_vs_db.blastpor something similar.
tblastn¶
tblastn -query [example query file] -db [example database file] -evalue 1e-05 -outfmt 6 -out [example_output.tblastn]
tblastndoes a protein vs translated nucleotide blast. (other choices areblastn,blastx,blastp,tblastx.)-querydefines your blast query– in this case, the Pfam seed sequences for the CRISPR RAMP proteins.-dbdefines your database– in this case, the toy assembly ORFs.-evaluedefines your maximum e-value, in this case 1x10-5-outfmtdefines the output format of your blast results. option 6 is common; you can check out https://www.ncbi.nlm.nih.gov/books/NBK279675/ for other options.-outdefines the name of your output file. I like to title mine with the general formatquery_vs_db.blastpor something similar.
Making a tree¶
First make an alignment with muscle, then make a tree with RAxML
First, use nano or some other text editing tool to make a fasta file with your sequences of interest. The following is an example of how to make a tree using amino acid (not nucleotide) sequences.
Example usage (replace names in brackets with your own names and remove brackets):
muscle -in [example_sequence_file.fasta] --out [example_sequence_file_aligned.afa]
raxmlHPC-PTHREADS-AVX -f a -# 20 -m PROTGAMMAAUTO -p 12345 -x 12345 -s [example_sequence_file_aligned.afa] -n [example_tree_name.tree] -T 4
Open the RAxML_bipartitions.example_tree_name.tree file in ITOL.
Mapping¶
Map short reads against a reference
Example usage (replace names in brackets with your own names and remove brackets):
bowtie2-build [reference fasta file] [reference_file.btindex]
bowtie2 -x [reference_file.btindex] -f -U [reads_to_map.fasta] -S [output_file.sam]
samtools view -bS [output_file.sam] > [output_file.bam]
samtools sort [output_file.bam] -o [output_file_sorted.bam]
samtools index [output_file_sorted.bam]
To get ORF coverage:
make_bed_file_from_gff_file_prokka.py [your gff file from prokka]
samtools bedcov [your bed file] [your sorted bam file] > [ an output file that ends in _ORF_coverage.txt]
mothur¶
Analyze 16S data to find taxonomy, OTUs
The protocol for week 6 laid out mothur step by step. The mothur wiki is another good source of information.
anvi’o¶
Make bins from metagenomic assemblies and BAM files
The protocol for week 7 laid out anvi’o step by step. The anvi’o tutorial is another good source of information.
Legacy Items¶
Prodigal¶
Finds ORFs in assembly file
Example usage:
mkdir ORF_finding
cd ORF_finding
prodigal -i ../toy_assembly/toy_dataset_assembly_subsample.fa -o toy_assembly_ORFs.gbk -a toy_assembly_ORFs.faa -p single
- The
-iflag gives the input file, which is the assembly you just made. - The
-oflag gives the output file in Genbank format - The ‘-a” flag gives the output file in fasta format
- The
-pflag states which procedure you’re using: whether this is a single genome or a metagenome. This toy dataset is > a single genome so we are using –p single, but for your project dataset, you will use –p meta.
Interproscan¶
ORF Fasta File -> Annotated TSV
Used to efficiently and effectively annotate proteins. Compares your open reading frames against several protein databases and looks for protein “signatures,” or regions that are highly conserved among proteins, and uses that to annotate your open reading frames. It will do this for every single open reading frame in your dataset, if it can find a match.
Example usage:
interproscan.sh -i toy_assembly_ORFs.noasterisks.faa -f tsv
- The
-iflag gives the input file, which is your file with ORF sequences identified by Prodigal, with the asterisks removed. - The
-fflag tells Interproscan that the format you want is a tab-separated vars, or “tsv,” file.