Input

The default input filename expected by Orthoflow is input_sources.csv. This default filename can be set in the config:

input_sources: "input_sources2.csv"

This default can be overridden in the command line arguments for orthoflow:

orthoflow --files input_sources3.csv

For the analysis to work, Orthoflow requires the following information for each input source:

  • file: The path to the particular input source file. This is path is relative to whatever file lists this source file.

  • taxon_string: A name for the taxon which is associated with all the sequences in the input file. If this value is not given then, the taxon string will be taken from the organism specified in the metadata of the source file if it is in GenBank format or it will be taken from the filename if it is not.

  • translation_table: For each input file, the user can give the translation table number which corresponds with the NCBI genetic codes. If it is not given, then Orthoflow looks in the GenBank file for a translation table otherwise it uses the default_translation_table config variable (which by default is set to 1).

  • data_type: To indicate the format of the file. This column should be GenBank when providing a GenBank-formatted file with CDS annotations, or CDS or Protein when providing a FASTA file with coding sequences consisting of nucleotides or amino acids respectively.

All this input information can be explicitly stated in a CSV file. Like this:

input_sources.csv

file

taxon_string

data_type

translation_table

KY509313.gb

Avrainvillea_mazei_HV02664

GenBank

11

NC_026795.txt

Bryopsis_plumosa_WEST4718

GenBank

11

KX808498.gb

Caulerpa_cliftonii_HV03798

GenBank

11

KY819064.cds.fasta

Chlorodesmis_fastigiata_HV03865

CDS

11

KX808497.fa

Derbesia_sp_WEST4838

CDS

11

MH591079.gbk

Dichotomosiphon_tuberosus_HV03781

GenBank

11

MH591080.gbk

Dichotomosiphon_tuberosus_HV03781

GenBank

11

MH591081.gbk

Dichotomosiphon_tuberosus_HV03781

GenBank

11

MH591083.gb

Flabellia_petiolata_HV01202

GenBank

11

MH591084.gb

Flabellia_petiolata_HV01202

GenBank

11

MH591085.gb

Flabellia_petiolata_HV01202

GenBank

11

MH591086.gb

Flabellia_petiolata_HV01202

GenBank

11

This file can also be given in YAML format:

files:
  - file: KY509313.gb
    taxon_string: Avrainvillea_mazei_HV02664
    translation_table: 11
    data_type: GenBank
  - file: NC_026795.txt
    taxon_string: Bryopsis_plumosa_WEST4718
    data_type: GenBank
    translation_table: 11
  - file: KX808498.gb
    taxon_string: Caulerpa_cliftonii_HV03798
    data_type: GenBank
    translation_table: 11
  - file: KY819064.cds.fasta
    taxon_string: Chlorodesmis_fastigiata_HV03865
    translation_table: 11
    data_type: CDS
  - file: KX808497.fa
    taxon_string: Derbesia_sp_WEST4838
    translation_table: 11
    data_type: CDS
  - file: MH591079.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
    data_type: GenBank
    translation_table: 11
  - file: MH591080.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
    data_type: GenBank
    translation_table: 11
  - file: MH591081.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
    data_type: GenBank
    translation_table: 11
  - file: MH591083.gb
    taxon_string: Flabellia_petiolata_HV01202
    data_type: GenBank
    translation_table: 11
  - file: MH591084.gb
    taxon_string: Flabellia_petiolata_HV01202
    data_type: GenBank
    translation_table: 11
  - file: MH591085.gb
    taxon_string: Flabellia_petiolata_HV01202
    data_type: GenBank
    translation_table: 11
  - file: MH591086.gb
    taxon_string: Flabellia_petiolata_HV01202
    data_type: GenBank
    translation_table: 11

Or TOML:

[[files]]
file = "KY509313.gb"
taxon_string = "Avrainvillea_mazei_HV02664"
translation_table = 11
data_type = "GenBank"

[[files]]
file = "NC_026795.txt"
taxon_string = "Bryopsis_plumosa_WEST4718"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "KX808498.gb"
taxon_string = "Caulerpa_cliftonii_HV03798"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "KY819064.cds.fasta"
taxon_string = "Chlorodesmis_fastigiata_HV03865"
translation_table = 11
data_type = "CDS"

[[files]]
file = "KX808497.fa"
taxon_string = "Derbesia_sp_WEST4838"
translation_table = 11
data_type = "CDS"

[[files]]
file = "MH591079.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591080.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591081.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591083.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591084.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591085.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11

[[files]]
file = "MH591086.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11

Or JSON:

{
    "files": [
      {
        "file": "KY509313.gb",
        "taxon_string": "Avrainvillea_mazei_HV02664",
        "translation_table": 11,
        "data_type": "GenBank"
      },
      {
        "file": "NC_026795.txt",
        "taxon_string": "Bryopsis_plumosa_WEST4718",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "KX808498.gb",
        "taxon_string": "Caulerpa_cliftonii_HV03798",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "KY819064.cds.fasta",
        "taxon_string": "Chlorodesmis_fastigiata_HV03865",
        "translation_table": 11,
        "data_type": "CDS"
      },
      {
        "file": "KX808497.fa",
        "taxon_string": "Derbesia_sp_WEST4838",
        "translation_table": 11,
        "data_type": "CDS"
      },
      {
        "file": "MH591079.gbk",
        "taxon_string": "Dichotomosiphon_tuberosus_HV03781",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591080.gbk",
        "taxon_string": "Dichotomosiphon_tuberosus_HV03781",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591081.gbk",
        "taxon_string": "Dichotomosiphon_tuberosus_HV03781",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591083.gb",
        "taxon_string": "Flabellia_petiolata_HV01202",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591084.gb",
        "taxon_string": "Flabellia_petiolata_HV01202",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591085.gb",
        "taxon_string": "Flabellia_petiolata_HV01202",
        "data_type": "GenBank",
        "translation_table": 11
      },
      {
        "file": "MH591086.gb",
        "taxon_string": "Flabellia_petiolata_HV01202",
        "data_type": "GenBank",
        "translation_table": 11
      }
    ]
  }

Since some of the values can be inferred from the files themselves, the ame input can be specified more concisely as follows (here in YAML format):

files:
  - file: KY509313.gb
    taxon_string: Avrainvillea_mazei_HV02664
  - file: NC_026795.txt
    taxon_string: Bryopsis_plumosa_WEST4718
    data_type: Genbank
  - file: KX808498.gb
    taxon_string: Caulerpa_cliftonii_HV03798
  - file: KY819064.cds.fasta
    taxon_string: Chlorodesmis_fastigiata_HV03865
    translation_table: 11
  - file: KX808497.fa
    taxon_string: Derbesia_sp_WEST4838
    translation_table: 11
  - file: MH591079.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
  - file: MH591080.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
  - file: MH591081.gbk
    taxon_string: Dichotomosiphon_tuberosus_HV03781
  - file: MH591083.gb
    taxon_string: Flabellia_petiolata_HV01202
  - file: MH591084.gb
    taxon_string: Flabellia_petiolata_HV01202
  - file: MH591085.gb
    taxon_string: Flabellia_petiolata_HV01202
  - file: MH591086.gb
    taxon_string: Flabellia_petiolata_HV01202

The input_sources can also be a list of files. For example, this command will include all the GenBank files in a particular directory and the translation_table, taxon_string, and data_type will all be inferred from the files themselves:

orthoflow --files *.gb

If some of the input files are in Fasta format and so the translation table is not easily inferred, then you can create an individal TOML/YAML/JSON or CSV file for that input source like this:

file = "KY819064-truncated.cds.fasta"
translation_table = 11
taxon_string = "Chlorodesmis_fastigiata_HV03865"

Then these files can be included as part of the list of Orthoflow input sources:

orthoflow --files *.gb *.toml

It is possible to ignore files that are not valid. The default setting is to stop the workflow when a file does not meet the program requirements. When it is desired that the program ignores these non-valid files and analysises the other files ignore_non_valid_files can be set to True. A warning will be displayed in the report stating which files have been ignored.