Preparing your Assembly¶
There are a few things that you can do to your multi-FASTA assembly to get it “ready” to be annotated. These steps include methods for removing small repetitive contigs from an assembly, sorting/renaming contig headers so they do not cause problems during prediction step, and repeatmasking your assembely (required).
Cleaning your Assembly¶
When working with haploid assemblies, sometimes you want to remove some repetitive contigs that are contained in other scaffolds of the assembly. If the repeats are indeed unique, then we want to keep them in the assembly. Funannotate can help “clean” up repetitive contigs in your assembly. This is done using a “leave one out” methodology using minimap2 or mummer (nucmer), where the the shortest contigs/scaffolds are aligned to the rest of the assembly to determine if it is repetitive. The script loops through the contigs starting with the shortest and workings its way to the N50 of the assembly, dropping contigs/scaffolds that are greater than the percent coverage of overlap (--cov
) and the percent identity of overlap (--pident
).
$ funannotate clean
Usage: funannotate clean <arguments>
version: 1.8.16
Description: The script sorts contigs by size, starting with shortest contigs it uses minimap2
to find contigs duplicated elsewhere, and then removes duplicated contigs.
Arguments:
-i, --input Multi-fasta genome file (Required)
-o, --out Cleaned multi-fasta output file (Required)
-p, --pident Percent identity of overlap. Default = 95
-c, --cov Percent coverage of overlap. Default = 95
-m, --minlen Minimum length of contig to keep. Default = 500
--exhaustive Test every contig. Default is to stop at N50 value.
Sorting/Rename FASTA Headers¶
NCBI limits the number of characters in a FASTA header for submission to 16 characters and Augustus also has problems with longer contig/scaffold names. You can use this simple script to sort your assembly by length and then rename the FASTA headers.
$funannotate sort
Usage: funannotate sort <arguments>
version: 1.8.16
Description: This script sorts the input contigs by size (longest->shortest) and then relabels
the contigs with a simple name (e.g. scaffold_1). Augustus can have problems with
some complicated contig names.
Arguments:
-i, --input Multi-fasta genome file. (Required)
-o, --out Sorted by size and relabeled output file. (Required)
-b, --base Base name to relabel contigs. Default: scaffold
--minlen Shorter contigs are discarded. Default: 0
RepeatMasking your Assembly¶
This is an essential step in the annotation process. As of v1.4.0 repeatmasking has been decoupled from funannotate predict
in order to make it more flexible and accomodate those users that don’t have access to the RepBase library (a requirement of RepeatMasker). The funannotate mask
command default is to run simple masking using tantan. The script is a wrapper for RepeatModeler and RepeatMasker, however you can use any external program to softmask your assembly. Softmasking is where repeats are represented by lowercase letters and all non-repetitive regions are uppercase letters. One alternative to RepeatMasker is RED (REpeat Detector) you can find a wrapper for this program Redmask.
$funannotate mask
Usage: funannotate mask <arguments>
version: 1.8.16
Description: This script is a wrapper for repeat masking. Default is to run very simple
repeat masking with tantan. The script can also run RepeatMasker and/or
RepeatModeler. It will generate a softmasked genome. Tantan is probably not
sufficient for soft-masking an assembly, but with RepBase no longer being
available RepeatMasker/Modeler may not be functional for many users.
Arguments:
-i, --input Multi-FASTA genome file. (Required)
-o, --out Output softmasked FASTA file. (Required)
Optional:
-m, --method Method to use. Default: tantan [repeatmasker, repeatmodeler]
-s, --repeatmasker_species Species to use for RepeatMasker
-l, --repeatmodeler_lib Custom repeat database (FASTA format)
--cpus Number of cpus to use. Default: 2
--debug Keep intermediate files