Data Intake Module

Rule rename_sequences[source]

Renames the sequence IDs in the input file so that the ID includes the taxon name, the filename and that each ID is unique.

It also extracts CDS features from GenBank files if necessary.

Conda
channels:
  - conda-forge
dependencies:
  - biopython
  - typer

Rule translate[source]

Translates coding sequences to amino acid sequences using BioKIT.

It relies on the translation_table field in the input. It expects a number there which corresponds with the NCBI genetic codes: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=tgencodes

BioKIT is found here: https://github.com/JLSteenwyk/BioKIT

It also copies the translated files to the protein intake foler.

Conda
channels:
  - conda-forge
dependencies:
  - coreutils
  - python=3.9
  - pip
  - pip:
    - git+https://github.com/JLSteenwyk/BioKIT.git