Preprocessing
Use this tutorial to learn how to generate the files needed to train Terrier from scratch. You can use this to replicate the results found in the paper (still to be released).
Repbase
Terrier is trained using Repbase, a database of repetitive elements. Repbase is available with a subscription from the Genetic Information Research Institute (GIRI) and is released under an academic user agreement which is available on the GIRI site. Terrier was trained using the 29.07 release of Repbase (2024-07-24) release.
Download Repbase database in FASTA format from https://www.girinst.org/server/RepBase/index.php
Untar the file and you will have a directory with a few dozen FASTA files with a .ref extension.
We will refer to this directory as REPBASE_DIR.
We will ignore the files in the archive directory.
Preprocess
Terrier takes two input files, a SeqTree and a SeqBank. The SeqBank holds the sequence data for each accession in the dataset. The SeqTree has the information about the cross-validation partition for each accession and which node in the taxonomic tree that the accession corresponds to.
These two files can be generated from the Repbase database using the terrier-tools CLI utility:
terrier-tools preprocess --input $REPBASE_DIR --seqbank $REPBASE_DIR/Repbase-seqbank.sb --seqtree $REPBASE_DIR/Repbase-seqtree.st
This will create a SeqBank file called Repbase-seqbank.sb and a SeqTree files called Repbase-seqtree.st and place them the $REPBASE_DIR.
The SeqTree file will have five cross-validation partitions and a taxonomic tree using the RepeatMasker schema.
To create a different number of partitions, run the command with the --partitions flag. For more options see the help:
terrier-tools preprocess --help
Now you are ready to train Terrier using the SeqBank and SeqTree files you have created.
Optional: Display the SeqTree
You can list the number of sequences for each node in the SeqTree file with this command:
seqtree render $REPBASE_DIR/Repbase-seqtree.st --print --count
That will output a tree with the number of sequences like this:
root
├── SINE (107)
│ ├── 7SL (192)
│ ├── tRNA (2312)
│ ├── 5S (35)
│ └── U (17)
├── LTR (1169)
│ ├── ERV (9127)
│ ├── Gypsy (28119)
│ ├── DIRS (1332)
│ ├── Copia (10619)
│ ├── Pao (5320)
│ └── Caulimovirus (207)
├── LINE (446)
│ ├── L1 (4833)
│ ├── R2 (336)
│ ├── CR1 (1237)
│ ├── I (1093)
│ ├── R1 (410)
│ ├── Rex-Babar (141)
│ ├── Dong-R4 (58)
│ ├── L2 (889)
│ ├── RTE (1065)
│ ├── Proto2 (69)
│ ├── CRE (99)
│ ├── Dualen (13)
│ ├── Proto1 (10)
│ └── Tad1 (550)
├── DNA (2404)
│ ├── hAT (6083)
│ ├── PiggyBac (627)
│ ├── TcMar (4884)
│ ├── Kolobok (857)
│ ├── MULE (2531)
│ ├── CMC (1667)
│ ├── Merlin (151)
│ ├── Maverick (237)
│ ├── P (320)
│ ├── Harbinger (2381)
│ ├── Dada (170)
│ ├── Crypton (294)
│ ├── Ginger (91)
│ ├── Academ (526)
│ ├── Zator (102)
│ ├── IS3EU (79)
│ ├── Zisupton (44)
│ ├── Sola (385)
│ └── Novosib (9)
├── Satellite (741)
├── RC
│ └── Helitron (1999)
├── Structural_RNA (86)
├── PLE (1045)
└── Other (11)
You can also display the SeqTree in a Sunburst chart by running:
seqtree sunburst $REPBASE_DIR/Repbase-seqtree.st --show --output $REPBASE_DIR/Repbase-seqtree.html
This will create an HTML file with the Sunburst chart of the SeqTree like this:
You can open the HTML file in a browser to view the chart.
You can also output the SeqTree with a .png, .svg, or .pdf extension by changing the extension of the output file.
Custom Datasets
You can create a custom repeat library in FASTA format, with the classification of each sequence like this:
>SeqID#DNA/Academ
ACTGACTGACTG...
Or with the classification separated with a tab character like this:
>SeqID LTR/Caulimovirus
ACTGACTGACTG...
Then preprocess like this:
terrier-tools preprocess --input custom.fasta --seqbank custom-seqbank.sb --seqtree custom-seqtree.st
Warning
Terrier will only be able to classify sequences into classes found here: https://github.com/rbturnbull/terrier/blob/main/terrier/data/repbase-to-repeatmasker.toml
You can include Repbase with your custom dataset like this:
terrier-tools preprocess --input $REPBASE_DIR --input custom.fasta --seqbank combined-seqbank.sb --seqtree combined-seqtree.st