Genome annotation

Structural genome annotation

Main steps:

QC assembly -> structural annotation -> manual curation -> functional annotation -> Submission or Downstream analysis

Approaches for annotations:

Popular tools:

Supported by the maker tool:

  • SNAP - Works ok, easy to train, not as good as others especially on longer intron genomes.
  • Augustus - Works great, hard to train (but getting better)
  • GeneMark-ES - Self training, no hints, buggy, not good for fragmented genomes or long introns (Best suited for Fungi).
  • FGENESH - Works great, costs money even for training.
  • GlimmerHMM (Eukaryote)
  • GenScan
  • Gnomon (NCBI)

PIPELINES:

PASA

  • Produces evidence-driven consensus gene models
  • (-) minimalist pipeline
  • (+) good for detecting isoforms
  • (+) biologically relevant predictions
  • using Ab initio tools and combined with EVM it does a pretty good job
  • PASA + Ab initio + EVM is not automatized

NCBI pipeline

  • best one yet - but difficult to install
  • NCBI staff can be asked an they help you
  • Evidence + ab initio (Gnomon), repeat masking, gene naming, data formatting, miRNAs, tRNAs

Ensembl

  • Evidence based only ( comparative + homology )

MAKER2

  • Evidence based and/or ab initio
  • developed as an easy-to-use alternative to other pipelines
  • Easy to use and to configure
  • Almost unlimited parallelism built-in (limited by data and hardware)
  • Largely independent from the underlying system it is run on
  • Everything is run through one command, no manual combining of data/outputs
  • Follows common standards, produces GMOD compliant output
  • Annotation Edit Distance (AED) metric for improved quality control
  • Provides a mechanism to train and retrain ab initio gene predictors
  • Annotations can be updated by re-launching Maker with new evidence

Genome annotation with augustus - practical

1. BUSCO for Assembly check

wget http://busco.ezlab.org/datasets/metazoa_odb9.tar.gz
tar xzvf metazoa_odb9.tar.gz
BUSCO.py -i ~/annotation_course/data/genome/4.fa -o 4_dmel_busco -m geno -c 8 -l metazoa_odb9
fasta_statisticsAndPlot.pl -f ~/annotation_course/data/genome/4.fa

2. augustus an ab initio gene finder

Augustus * call it with a genome/assembly and it saves the annotation as a gff3 file

augustus --species=fly ~/annotation_course/data/genome/4.fa --gff3=yes > augustus_drosophila.gff
#to get additional isoforms use this:
augustus --species=fly ~/annotation_course/data/genome/4.fa --gff3=yes --alternatives-from-sampling=true > augustus_drosophila_isoform.gff
gff3_sp_statistics.pl --gff augustus_drosophila.gff

3. Visualisation

maker gene build pipeline - practical

MAKER is a computational pipeline to automatically generate annotations from a range of input data. The Maker pipeline can work with any combination of the following data sets, which are put into the maker_opts.ctl:

1. Start maker

2.1 Evidence-based annotation

We need to prepare various files and add their path to the maker_opts.ctl file (use , to seperate files):

  • name of the genome sequence (genome=)
  • name of the 'EST' set file(s) (est=)
  • name of the 'Protein' set file(s) (protein=)
  • name of the repeatmasker and repeatrunner files (rm_gff=)
  • disabling ab initio with protein2genome=1, est2genome=1
  • we deactivated the parameters model_org= and repeat_protein= to avoid the heavy work of repeatmasker (blank values)
mpiexec -n 8 maker # -n is core numbers
# and compile the output with:
maker_merge_outputs_from_datastore.pl --output maker_no_abinitio
gff3_sp_statistics.pl --gff maker_no_abinitio/annotationByType/maker.gff

2.2 Run Maker with ab-initio predictions

mpiexec -n 8 maker
#compiling
maker_merge_outputs_from_datastore.pl --output maker_with_abinitio
#check statistics with this GAAS script
gff3_sp_statistics.pl --gff maker_with_abinitio/annotationByType/maker.gff

Summary: We created 2 maker.gff files. One in maker_with_abinto/ one in maker_noabinto directory/. Doing Step 2.2 helped to make the gene predictions less fragmented. For best performance go all th way to step 2.2.

For Intron / Exon manual curation look for the GT/AG in an Intron 5'-3' [EXON]++GT--intron--AG++[EXON] 3'-5' [EXON]++GA--intron--TG++[EXON]

Functional genome annotation

Computationally: Sequence based - mainly done * based on similarity/motif/profile * orthology based on evolutionary relationship * Clustering with KOG/COG * Synteny: Satsuma + kraken + custom script * phylogeny based Structure based * global structure comparison * localized regions * active sites resides * Protein-Protein Interaction data

blast based functional genome annotation

Database Information Comment
KEGG Pathway Kyoto Encyclopedia of Genes and Genomes
MetaCyc Pathway Curated database of experimentally elucidated metabolic pathways from all domains of life (NIH)
Reactome Pathway Curated and peer reviewed pathway database
UniPathway Pathway Manually curated resource of enzyme-catalyzed and spontaneous chemical reactions.
GO Gene Ontology Three structured, controlled vocabularies (ontologies) : biological processes, cellular components and molecular functions
Pfam Protein families Multiple sequence alignments and hidden Markov models
Interpro P. fam., domains & functional sites Run separate search applications, and create a signature to search against Interpro.
Tool Approach Comment
Trinotate Best blast hit, protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases). Not automated
Annocript Best blast hit Collects the best-hit and related annotations (proteins, domains, GO terms, Enzymes, pathways, short)
Annot8r Best blast hits A tool for Gene Ontology, KEGG biochemical pathways and Enzyme Commission EC number annotation of nucleotide and peptide sequences.
Sma3s Best blast hit, Best reciprocal blast hit, clusterisation 3 annotation levels
afterParty BLAST, InterProScan web application
Interproscan Separate search applications, HMMs, fingerprints, patterns of InterPro Created to unite secondary databases
Blast2Go get best blast hits Retrieve only GO,Commercial !

Interproscan approach - practical

gff3_sp_extract_sequences.pl --gff maker_with_abinitio.gff -f 4.fa -p --cfs -o AA.fa
#takes 2-3 seconds per protein
interproscan.sh -i AA.fa -t p -dp -pa -appl Pfam,ProDom-2006.1,SuperFamily-1.75 --goterms --iprlookup
ipr_update_gff maker_with_abinitio.gff AA.fa.tsv >  maker_with_abinitio_with_interpro.gff

BLAST approach - practical

blastp -db ~/annotation_course/data/blastdb/uniprot_dmel/uniprot_dmel.fa -query AA.fa -outfmt 6 -out blast.out -num_threads 8
git clone https://github.com/genomeannotation/Annie.git
# This is the annie.py script
Annie/annie.py -b blast.out -db ~/annotation_course/data/blastdb/uniprot_dmel/uniprot_dmel.fa -g maker_with_abinitio.gff -o annotation_blast.annie
maker_gff3manager_JD_v8.pl -f maker_with_abinitio_with_interpro.gff -b annotation_blast.annie --ID FLY -o finalOutputDir
/home/student/.local/GAAS/annotation/WebApollo/gff3_webApollo_compliant.pl --gff finalOutputDir/codingGeneFeatures.gff -o final_annotation.gff

Subbmitting to EBI using a tool

In order to submit to EBI, the use of a tool like EMBLmyGFF3 will be your best choice.

Let's prepare your annotation to submit to ENA (EBI) You need to create an account and create a project asking a locus_tag for your annotation. You have to fill lot of metada information related to the assembly and so on. We will skip those tasks using fake information. First you need to download and install EMBLmyGFF3:

pip install --user git+https://github.com/NBISweden/EMBLmyGFF3.git
EMBLmyGFF3 finalOutputDir/codingGeneFeatures.gff 4.fa -o my_annotation_ready_to_submit.embl

You now have a EMBL flat file ready to submit.