Input
The default input filename expected by Orthoflow is input_sources.csv
.
This default filename can be set in the config:
input_sources: "input_sources2.csv"
This default can be overridden in the command line arguments for orthoflow:
orthoflow --files input_sources3.csv
For the analysis to work, Orthoflow requires the following information for each input source:
file
: The path to the particular input source file. This is path is relative to whatever file lists this source file.taxon_string
: A name for the taxon which is associated with all the sequences in the input file. If this value is not given then, the taxon string will be taken from the organism specified in the metadata of the source file if it is in GenBank format or it will be taken from the filename if it is not.translation_table
: For each input file, the user can give the translation table number which corresponds with the NCBI genetic codes. If it is not given, then Orthoflow looks in the GenBank file for a translation table otherwise it uses thedefault_translation_table
config variable (which by default is set to 1).data_type
: To indicate the format of the file. This column should beGenBank
when providing a GenBank-formatted file with CDS annotations, orCDS
orProtein
when providing a FASTA file with coding sequences consisting of nucleotides or amino acids respectively.
All this input information can be explicitly stated in a CSV file. Like this:
file |
taxon_string |
data_type |
translation_table |
---|---|---|---|
KY509313.gb |
Avrainvillea_mazei_HV02664 |
GenBank |
11 |
NC_026795.txt |
Bryopsis_plumosa_WEST4718 |
GenBank |
11 |
KX808498.gb |
Caulerpa_cliftonii_HV03798 |
GenBank |
11 |
KY819064.cds.fasta |
Chlorodesmis_fastigiata_HV03865 |
CDS |
11 |
KX808497.fa |
Derbesia_sp_WEST4838 |
CDS |
11 |
MH591079.gbk |
Dichotomosiphon_tuberosus_HV03781 |
GenBank |
11 |
MH591080.gbk |
Dichotomosiphon_tuberosus_HV03781 |
GenBank |
11 |
MH591081.gbk |
Dichotomosiphon_tuberosus_HV03781 |
GenBank |
11 |
MH591083.gb |
Flabellia_petiolata_HV01202 |
GenBank |
11 |
MH591084.gb |
Flabellia_petiolata_HV01202 |
GenBank |
11 |
MH591085.gb |
Flabellia_petiolata_HV01202 |
GenBank |
11 |
MH591086.gb |
Flabellia_petiolata_HV01202 |
GenBank |
11 |
This file can also be given in YAML format:
files:
- file: KY509313.gb
taxon_string: Avrainvillea_mazei_HV02664
translation_table: 11
data_type: GenBank
- file: NC_026795.txt
taxon_string: Bryopsis_plumosa_WEST4718
data_type: GenBank
translation_table: 11
- file: KX808498.gb
taxon_string: Caulerpa_cliftonii_HV03798
data_type: GenBank
translation_table: 11
- file: KY819064.cds.fasta
taxon_string: Chlorodesmis_fastigiata_HV03865
translation_table: 11
data_type: CDS
- file: KX808497.fa
taxon_string: Derbesia_sp_WEST4838
translation_table: 11
data_type: CDS
- file: MH591079.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
data_type: GenBank
translation_table: 11
- file: MH591080.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
data_type: GenBank
translation_table: 11
- file: MH591081.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
data_type: GenBank
translation_table: 11
- file: MH591083.gb
taxon_string: Flabellia_petiolata_HV01202
data_type: GenBank
translation_table: 11
- file: MH591084.gb
taxon_string: Flabellia_petiolata_HV01202
data_type: GenBank
translation_table: 11
- file: MH591085.gb
taxon_string: Flabellia_petiolata_HV01202
data_type: GenBank
translation_table: 11
- file: MH591086.gb
taxon_string: Flabellia_petiolata_HV01202
data_type: GenBank
translation_table: 11
Or TOML:
[[files]]
file = "KY509313.gb"
taxon_string = "Avrainvillea_mazei_HV02664"
translation_table = 11
data_type = "GenBank"
[[files]]
file = "NC_026795.txt"
taxon_string = "Bryopsis_plumosa_WEST4718"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "KX808498.gb"
taxon_string = "Caulerpa_cliftonii_HV03798"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "KY819064.cds.fasta"
taxon_string = "Chlorodesmis_fastigiata_HV03865"
translation_table = 11
data_type = "CDS"
[[files]]
file = "KX808497.fa"
taxon_string = "Derbesia_sp_WEST4838"
translation_table = 11
data_type = "CDS"
[[files]]
file = "MH591079.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591080.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591081.gbk"
taxon_string = "Dichotomosiphon_tuberosus_HV03781"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591083.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591084.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591085.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11
[[files]]
file = "MH591086.gb"
taxon_string = "Flabellia_petiolata_HV01202"
data_type = "GenBank"
translation_table = 11
Or JSON:
{
"files": [
{
"file": "KY509313.gb",
"taxon_string": "Avrainvillea_mazei_HV02664",
"translation_table": 11,
"data_type": "GenBank"
},
{
"file": "NC_026795.txt",
"taxon_string": "Bryopsis_plumosa_WEST4718",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "KX808498.gb",
"taxon_string": "Caulerpa_cliftonii_HV03798",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "KY819064.cds.fasta",
"taxon_string": "Chlorodesmis_fastigiata_HV03865",
"translation_table": 11,
"data_type": "CDS"
},
{
"file": "KX808497.fa",
"taxon_string": "Derbesia_sp_WEST4838",
"translation_table": 11,
"data_type": "CDS"
},
{
"file": "MH591079.gbk",
"taxon_string": "Dichotomosiphon_tuberosus_HV03781",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591080.gbk",
"taxon_string": "Dichotomosiphon_tuberosus_HV03781",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591081.gbk",
"taxon_string": "Dichotomosiphon_tuberosus_HV03781",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591083.gb",
"taxon_string": "Flabellia_petiolata_HV01202",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591084.gb",
"taxon_string": "Flabellia_petiolata_HV01202",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591085.gb",
"taxon_string": "Flabellia_petiolata_HV01202",
"data_type": "GenBank",
"translation_table": 11
},
{
"file": "MH591086.gb",
"taxon_string": "Flabellia_petiolata_HV01202",
"data_type": "GenBank",
"translation_table": 11
}
]
}
Since some of the values can be inferred from the files themselves, the ame input can be specified more concisely as follows (here in YAML format):
files:
- file: KY509313.gb
taxon_string: Avrainvillea_mazei_HV02664
- file: NC_026795.txt
taxon_string: Bryopsis_plumosa_WEST4718
data_type: Genbank
- file: KX808498.gb
taxon_string: Caulerpa_cliftonii_HV03798
- file: KY819064.cds.fasta
taxon_string: Chlorodesmis_fastigiata_HV03865
translation_table: 11
- file: KX808497.fa
taxon_string: Derbesia_sp_WEST4838
translation_table: 11
- file: MH591079.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
- file: MH591080.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
- file: MH591081.gbk
taxon_string: Dichotomosiphon_tuberosus_HV03781
- file: MH591083.gb
taxon_string: Flabellia_petiolata_HV01202
- file: MH591084.gb
taxon_string: Flabellia_petiolata_HV01202
- file: MH591085.gb
taxon_string: Flabellia_petiolata_HV01202
- file: MH591086.gb
taxon_string: Flabellia_petiolata_HV01202
The input_sources
can also be a list of files. For example, this command will include all the GenBank files in a particular directory and the translation_table, taxon_string, and data_type will all be inferred from the files themselves:
orthoflow --files *.gb
If some of the input files are in Fasta format and so the translation table is not easily inferred, then you can create an individal TOML/YAML/JSON or CSV file for that input source like this:
file = "KY819064-truncated.cds.fasta"
translation_table = 11
taxon_string = "Chlorodesmis_fastigiata_HV03865"
Then these files can be included as part of the list of Orthoflow input sources:
orthoflow --files *.gb *.toml
It is possible to ignore files that are not valid. The default setting is to stop the workflow when a file does not meet the program requirements. When it is desired that the program ignores these non-valid files and analysises the other files ignore_non_valid_files
can be set to True
. A warning will be displayed in the report stating which files have been ignored.