Functional annotation¶

After your genome has gone through the gene prediction module and you have gene models that pass NCBI specs the next step is to add functional annotate to the protein-coding genes. Funannotate accomplishes this using several curated databases and is run using the funannotate annotate command.

Funannotate will parse the protein-coding models from the annotation and identify Pfam domains, CAZYmes, secreted proteins, proteases (MEROPS), and BUSCO groups. If you provide the script with InterProScan5 data --iprscan, funannotate will also generate additional annotation: InterPro terms, GO ontology, and fungal transcription factors. If Eggnog-mapper is installed locally or you pass eggnog results via --eggnog, then Eggnog annotations and COGs will be added to the functional annotation. The scripts will also parse UniProtKb/SwissProt searches with Eggnog-mapper searches (optional) to generate gene names and product descriptions.

InterProScan5 and Eggnog-Mapper are two functional annotation pipelines that can be parsed by funannotate, however due to the large database sizes they are not run directly. If emapper.py (Eggnog-mapper) is installed, then it will be run automatically during the functional annotation process. Because InterProScan5 is Linux only, it must be run outside funannotate and the results passed to the script. If you are on Mac, I’ve included a method to run InterProScan5 using Docker and the funannotate predict output will let the user know how to run this script. Alternatively, you can run the InterProScan5 search remotely using the funannotate remote command.

Phobius and SignalP will be run automatically if they are installed (i.e. in the PATH), however, Phobius will not run on Mac. If you are on Mac you can run Phobius with the funannotate remote script.

If you are annotating a fungal genome, you can run Secondary Metabolite Gene Cluster prediction using antiSMASH. This can be done on the webserver, submit your GBK file from predict (predict_results/yourGenome.gbk) or alternatively you can submit from the command line using funannotate remote. Of course, if you are on Linux you can install the antiSMASH program locally and run that way as well. The annotated GBK file is fed back to this script with the --antismash option.

Similarily to funannotate predict, the output from funannotate annotate will be populated in the output/annotate_results folder. The output files are:

File Name	Description
Basename.gbk	Annotated Genome in GenBank Flat File format
Basename.contigs.fsa	Multi-fasta file of contigs, split at gaps (use for NCBI submission)
Basename.agp	AGP file; showing linkage/location of contigs (use for NCBI submission)
Basename.tbl	NCBI tbl annotation file (use for NCBI submission)
Basename.sqn	NCBI Sequin genome file (use for NCBI submission)
Basename.scaffolds.fa	Multi-fasta file of scaffolds
Basename.proteins.fa	Multi-fasta file of protein coding genes
Basename.transcripts.fa	Multi-fasta file of transcripts (mRNA)
Basename.discrepency.report.txt	tbl2asn summary report of annotated genome
Basename.annotations.txt	TSV file of all annotations added to genome. (i.e. import into excel)
Gene2Products.must-fix.txt	TSV file of Gene Name/Product deflines that failed to pass tbl2asn checks and must be fixed
Gene2Products.need-curating.txt	TSV file of Gene Name/Product defines that need to be curated
Gene2Products.new-names-passed.txt	TSV file of Gene Name/Product deflines that passed tbl2asn but are not in Gene2Products database. Please submit a PR with these.

      $ funannotate annotate

Usage:       funannotate annotate <arguments>
version:     1.8.16

Description: Script functionally annotates the results from funannotate predict.  It pulls
             annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.

Required:
  -i, --input          Folder from funannotate predict
    or
  --genbank            Genome in GenBank format
  -o, --out            Output folder for results
    or
  --gff                Genome GFF3 annotation file
  --fasta              Genome in multi-fasta format
  -s, --species        Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
  -o, --out            Output folder for results

Optional:
  --sbt                NCBI submission template file. (Recommended)
  -a, --annotations    Custom annotations (3 column tsv file)
  -m, --mito-pass-thru Mitochondrial genome/contigs. append with :mcode
  --eggnog             Eggnog-mapper annotations file (if NOT installed)
  --antismash          antiSMASH secondary metabolism results (GBK file from output)
  --iprscan            InterProScan5 XML file
  --phobius            Phobius pre-computed results (if phobius NOT installed)
  --signalp            SignalP pre-computed results (-org euk -format short)
  --isolate            Isolate name
  --strain             Strain name
  --rename             Rename GFF gene models with locus_tag from NCBI.
  --fix                Gene/Product names fixed (TSV: GeneID  Name    Product)
  --remove             Gene/Product names to remove (TSV: Gene        Product)
  --busco_db           BUSCO models. Default: dikarya
  -t, --tbl2asn        Additional parameters for tbl2asn. Default: "-l paired-ends"
  -d, --database       Path to funannotate database. Default: $FUNANNOTATE_DB
  --force              Force over-write of output folder
  --cpus               Number of CPUs to use. Default: 2
  --tmpdir             Volume/location to write temporary files. Default: /tmp
  --p2g                protein2genome pre-computed results
  --header_length      Maximum length of FASTA headers. Default: 16
  --no-progress        Do not print progress to stdout for long sub jobs