Preprocessing
Use this tutorial to learn how to generate the files needed to train Terrier from scratch. You can use this to replicate the results found in the paper (still to be released).
Repbase
Terrier is trained using Repbase, a database of repetitive elements. Repbase is available with a subscription from the Genetic Information Research Institute (GIRI) and is released under an academic user agreement which is available on the GIRI site. Terrier was trained using the 29.07 release of Repbase (2024-07-24) release.
Download Repbase database in FASTA format from https://www.girinst.org/server/RepBase/index.php
Untar the file and you will have a directory with a few dozen FASTA files with a .ref
extension.
We will refer to this directory as REPBASE_DIR
.
We will ignore the files in the archive
directory.
Preprocess
Terrier takes two input files, a SeqTree and a SeqBank. The SeqBank holds the sequence data for each accession in the dataset. The SeqTree has the information about the cross-validation partition for each accession and which node in the taxonomic tree that the accession corresponds to.
These two files can be generated from the Repbase database using the terrier-tools
CLI utility:
terrier-tools preprocess --repbase $REPBASE_DIR --seqbank $REPBASE_DIR/Repbase-seqbank.sb --seqtree $REPBASE_DIR/Repbase-seqtree.st
This will create a SeqBank file called Repbase-seqbank.sb
and a SeqTree files called Repbase-seqtree.st
and place them the $REPBASE_DIR
.
The SeqTree file will have five cross-validation partitions and a taxonomic tree using the RepeatMasker schema.
To create a different number of partitions, run the command with the --partitions
flag. For more options see the help:
terrier-tools preprocess --help
Now you are ready to train Terrier using the SeqBank and SeqTree files you have created.
Optional: Display the SeqTree
You can list the number of accessions for each node in the SeqTree file with this command:
seqtree render $REPBASE_DIR/Repbase-seqtree.st --print --count
That will output a tree with the number of accessions like this:
root
├── SINE (107)
│ ├── 7SL (192)
│ ├── tRNA (2312)
│ ├── 5S (35)
│ └── U (17)
├── LTR (1169)
│ ├── ERV (9127)
│ ├── Gypsy (28119)
│ ├── DIRS (1332)
│ ├── Copia (10619)
│ ├── Pao (5320)
│ └── Caulimovirus (207)
├── LINE (446)
│ ├── L1 (4833)
│ ├── R2 (336)
│ ├── CR1 (1237)
│ ├── I (1093)
│ ├── R1 (410)
│ ├── Rex-Babar (141)
│ ├── Dong-R4 (58)
│ ├── L2 (889)
│ ├── RTE (1065)
│ ├── Proto2 (69)
│ ├── CRE (99)
│ ├── Dualen (13)
│ ├── Proto1 (10)
│ └── Tad1 (550)
├── DNA (2404)
│ ├── hAT (6083)
│ ├── PiggyBac (627)
│ ├── TcMar (4884)
│ ├── Kolobok (857)
│ ├── MULE (2531)
│ ├── CMC (1667)
│ ├── Merlin (151)
│ ├── Maverick (237)
│ ├── P (320)
│ ├── Harbinger (2381)
│ ├── Dada (170)
│ ├── Crypton (294)
│ ├── Ginger (91)
│ ├── Academ (526)
│ ├── Zator (102)
│ ├── IS3EU (79)
│ ├── Zisupton (44)
│ ├── Sola (385)
│ └── Novosib (9)
├── Satellite (741)
├── RC
│ └── Helitron (1999)
├── Structural_RNA (86)
├── PLE (1045)
└── Other (11)
You can also display the SeqTree in a Sunburst chart by running:
seqtree sunburst $REPBASE_DIR/Repbase-seqtree.st --show --output $REPBASE_DIR/Repbase-seqtree.html
This will create an HTML file with the Sunburst chart of the SeqTree like this:
You can open the HTML file in a browser to view the chart.
You can also output the SeqTree with a .png, .svg, or .pdf extension by changing the extension of the output file.