Advanced Tutorial

Changing workflow settings

Some workflow settings can be changed by passing arguments to the command-line tool (orthoflow).

To see a complete list of command-line arguments, run:

orthoflow run --help

You can pass any Snakemake arguments to orthoflow. To list these, run:

orthoflow run --help-snakemake

Most settings to operate and tune aspects of the workflow can be changed by editing the standard configuration file (orthoflow/config/config.yml) or writing a custom configuration file that can be passed to the workflow. You will see examples of changes to the configuration file throughout this tutorial.

Cores

To manually set the number of cores, set the --cores flag. For example, to use 24 cores:

orthoflow run --cores 24

If --cores is not set, then it will use all available CPU cores.

Setting working directory and specifying targets

To set a working directory different to the current directory, use the –directory flag:

orthoflow run --directory=path/to/working/dir

To use a specific target, give that as an argument. For instance to produce the list of protein alignments, run as follows

orthoflow run results/alignment/alignments_list.protein.txt

Controlling the flow of operations

By default, Orthoflow uses the de novo orthology inference module (OrthoFinder and OrthoSNAP) and supermatrix-based tree inference (supermatrix module).

This can be changed in the configuration file. Setting the use_orthofisher to True will enable the ortholog fishing module instead of the de novo orthology inference module. This also requires you to specify a set of HMM profiles listed under orthofisher_hmmer_files. For example:

use_orthofisher: True
orthofisher_hmmer_files:
- hmms/1080at3041.hmm
- hmms/1103at3041.hmm
- hmms/1271at3041.hmm
- hmms/1379at3041.hmm
- hmms/1518at3041.hmm
- hmms/1569at3041.hmm
- hmms/1610at3041.hmm

The orthofisher and orthofinder paths are mutually exclusive.

By default, a phylogeny is inferred from a supermatrix. To use the supertree (ASTRAL) path for tree inference, set supertree: True. If both supermatrix and supertree are set to True, the workflow will run both types of inference.

You can also run the workflow up to any given point by specifying the rule with the snakemake –until` flag, specifying the rule where you’d like to stop. An example might be where you want to produce the CDS nucleotide alignments for all orthogroups but no trees. In that case you could set the infer_tree_with_protein_seqs: False to indicate that you wish nucleotide alignments to be produced, activate supermatrix inference (supermatrix: True) but keep snakemake from actually producing the supermatrix using –until concatenate_alignments. Snakemake will run everything up until the start of this rule (which concatenates the single-gene alignments into the supermatrix), so the output will include the DNA alignments.

Gene and alignment filtering settings

There are several steps in the workflow that filter out genes not meeting particular criteria.

Minimum taxa & sequences

The minimum number of taxa and sequences that need to be in a gene dataset in order for it to be retained is one of the most important settings that users need to specify. These should be set with the ortholog_min_seqs and ortholog_min_taxa settings in the configuration file. The default value is 5, and the value should not be set to less than 3. The default settings are very permissive, and this can lead to huge numbers of files (hundreds of thousands) being produced when analysing eukaryotic genome datasets, which can lead to problems especially on HPC systems that limit the number of files you can create. We recommend increasing these values depending on the number of taxa in your dataset. We typically set it to 30-50% of the total number of taxa being analysed, but it is recommended to experiment with these settings, as it will be highly dataset-dependent. The occupancy setting for OrthoSNAP can be changed with orthosnap_occupancy; by default we use the same value as that for ortholog_min_seqs.

Using SC-OGs and/or SNAP-OGs

The traditional approach towards inferring species trees from genome data is to select single-copy orthogroups (SC-OGs). One of the innovations we’ve implemented in this workflow is the use of SNAP-OGs, sets of orthologous sequences derived from multi-copy gene families, which can yield orders of magnitude more data. You can set whether you want to build a phylogeny from just the SC-OGs, just the SNAP-OGs or both combined by setting the orthofinder_use_scogs and orthofinder_use_snap_ogs in the configuration file. In this example both SC-OGs and SNAP-OGs are combined for phylogenetic inference:

use_scogs: True
use_snap_ogs: True

SNAP-OGs are currently only implemented in the de novo ortholog analysis path. When using the ortholog fishing path, only SC-OGs will be used for downstream analyses.

Alignment trimming

Alignments are trimmed for quality with the smart gap method implemented in ClipKit.

Removal of heavily trimmed alignments

In some cases, it may make more sense to remove genes that have been decimated by the alignment trimming proceduce, particularly if they are going to be used individually to infer gene trees. There are two ways to achieve his. First, any alignments that fall below a given number of amino acid positions after trimming will be removed (default: 167 amino acid positions or 501 corresponding nucleotide positions, can be changed with minimum_trimmed_alignment_length_cds and minimum_trimmed_alignment_length_proteins parameters in the config file). Second, the workflow also allows removing alignments from which a large proportion was removed in the trimming step. By default, alignments that lose half of their length in trimming get removed (change by setting the max_trimmed_proportion parameter in the config file).

Tree inference settings

The phylogenetic analysis can be run on protein and/or nucleotide sequences. This can be set in the config file with infer_tree_with_cds_seqs and infer_tree_with_protein_seqs. The default setting is to do 2 different analyses with protein and nucleotide sequences. infer_tree_with_cds_seqs should be set to False when 1 or more amino acid input files are used in the input.

To use an outgroup in the phylogenetic analysis, specify an outgroup taxon (using its value in the taxon_string column in the input sources file). For example, for the demonstration dataset outgroup: "Derbesia_sp_WEST4838". The outgroup will only be used in the supermatrix path. We are not including this functionality for the gene tree path as the outgroup might not be present in each alignment.

To specify a model of sequence evolution, the config file has a model_string setting where you can specify a model following the IQ-tree syntax. The default setting model_string: "-m TEST" will perform model testing to determine a suitable model. but any model implemented in IQ-tree can be specified here. For instance “-m GTR+F+G” for a nucleotide General Time Reversible (GTR) model with empirical base frequencies (+F) and a discrete gamma model (+G) for rate heterogeneity. For further information on the model options and their specification, see the IQ-tree documentation

For bootstrapping, you can specify the bootstrap_string variable in the config file. By default, this is set to bootstrap_string: "-bb 1000" to carry out 1000 ultrafast bootstrap replicates. To change this to 100 standard (nonparametric) bootstraps, for instance, use bootstrap_string: "-b 100". See the IQ-tree documentation for further information on how to specify bootstrapping.

Other Command Line Tools

To see all the command line tools for Orthoflow, run:

orthoflow --help