Preprocessing

Use this tutorial to learn how to generate the files needed to train Terrier from scratch. You can use this to replicate the results found in the paper (still to be released).

Repbase

Terrier is trained using Repbase, a database of repetitive elements. Repbase is available with a subscription from the Genetic Information Research Institute (GIRI) and is released under an academic user agreement which is available on the GIRI site. Terrier was trained using the 29.07 release of Repbase (2024-07-24) release.

Download Repbase database in FASTA format from https://www.girinst.org/server/RepBase/index.php Untar the file and you will have a directory with a few dozen FASTA files with a .ref extension. We will refer to this directory as REPBASE_DIR. We will ignore the files in the archive directory.

Preprocess

Terrier takes two input files, a SeqTree and a SeqBank. The SeqBank holds the sequence data for each accession in the dataset. The SeqTree has the information about the cross-validation partition for each accession and which node in the taxonomic tree that the accession corresponds to.

These two files can be generated from the Repbase database using the terrier-tools CLI utility:

terrier-tools preprocess --repbase $REPBASE_DIR --seqbank $REPBASE_DIR/Repbase-seqbank.sb --seqtree $REPBASE_DIR/Repbase-seqtree.st

This will create a SeqBank file called Repbase-seqbank.sb and a SeqTree files called Repbase-seqtree.st and place them the $REPBASE_DIR.

The SeqTree file will have five cross-validation partitions and a taxonomic tree using the RepeatMasker schema.

To create a different number of partitions, run the command with the --partitions flag. For more options see the help:

terrier-tools preprocess --help

Now you are ready to train Terrier using the SeqBank and SeqTree files you have created.

Optional: Display the SeqTree

You can list the number of accessions for each node in the SeqTree file with this command:

seqtree render $REPBASE_DIR/Repbase-seqtree.st --print --count

That will output a tree with the number of accessions like this:

root
├── SINE (107)
│   ├── 7SL (192)
│   ├── tRNA (2312)
│   ├── 5S (35)
│   └── U (17)
├── LTR (1169)
│   ├── ERV (9127)
│   ├── Gypsy (28119)
│   ├── DIRS (1332)
│   ├── Copia (10619)
│   ├── Pao (5320)
│   └── Caulimovirus (207)
├── LINE (446)
│   ├── L1 (4833)
│   ├── R2 (336)
│   ├── CR1 (1237)
│   ├── I (1093)
│   ├── R1 (410)
│   ├── Rex-Babar (141)
│   ├── Dong-R4 (58)
│   ├── L2 (889)
│   ├── RTE (1065)
│   ├── Proto2 (69)
│   ├── CRE (99)
│   ├── Dualen (13)
│   ├── Proto1 (10)
│   └── Tad1 (550)
├── DNA (2404)
│   ├── hAT (6083)
│   ├── PiggyBac (627)
│   ├── TcMar (4884)
│   ├── Kolobok (857)
│   ├── MULE (2531)
│   ├── CMC (1667)
│   ├── Merlin (151)
│   ├── Maverick (237)
│   ├── P (320)
│   ├── Harbinger (2381)
│   ├── Dada (170)
│   ├── Crypton (294)
│   ├── Ginger (91)
│   ├── Academ (526)
│   ├── Zator (102)
│   ├── IS3EU (79)
│   ├── Zisupton (44)
│   ├── Sola (385)
│   └── Novosib (9)
├── Satellite (741)
├── RC
│   └── Helitron (1999)
├── Structural_RNA (86)
├── PLE (1045)
└── Other (11)

You can also display the SeqTree in a Sunburst chart by running:

seqtree sunburst $REPBASE_DIR/Repbase-seqtree.st --show --output $REPBASE_DIR/Repbase-seqtree.html

This will create an HTML file with the Sunburst chart of the SeqTree like this:

You can open the HTML file in a browser to view the chart.

You can also output the SeqTree with a .png, .svg, or .pdf extension by changing the extension of the output file.