API Reference

class seqbank.seqbank.SeqBank(path: Path, write: bool = False)
add(seq: str | Seq | SeqRecord | ndarray, accession: str) None

Adds a sequence to the SeqBank with a given accession.

This method accepts a sequence in various formats (string, Seq, SeqRecord, or NumPy array) and stores it in the SeqBank after appropriate conversion to byte format.

Parameters:
  • seq (str | Seq | SeqRecord | np.ndarray) – The sequence to add to the SeqBank. It can be a string, Bio.Seq object, SeqRecord, or a NumPy array.

  • accession (str) – The accession key for the sequence to be stored under.

Returns:

None

add_file(path: Path, format: str = '', progress=None, overall_task=None, filter: Path | list | set | None = None) None

Adds sequences from a file to the SeqBank.

This method processes a sequence file in various formats (e.g., FASTA, FASTQ), optionally filtering specific accessions, and adds the sequences to the SeqBank. Progress tracking is available for large file imports.

Parameters:
  • path (Path) – The path to the sequence file.

  • format (str, optional) – The format of the sequence file (e.g., “fasta”, “fastq”). If not provided, it will be auto-detected.

  • progress (Progress, optional) – A rich progress bar to display the import progress. Defaults to None.

  • overall_task (int | None, optional) – An optional task ID for tracking the overall progress. Defaults to None.

  • filter (Path | list | set | None, optional) – A filter for selecting specific accessions. Defaults to None.

Returns:

None

add_files(files: list[str], format: str = '', workers: int = 1, filter: Path | list[str] | set[str] | None = None) None

Adds sequences from multiple files to the SeqBank.

This method processes a list of file paths, downloading and adding sequences from each file to the SeqBank. It supports parallel processing and optional filtering of specific accessions. The max argument limits the number of files to process.

Parameters:
  • files (list[str]) – A list of file paths to process.

  • max (int, optional) – Maximum number of files to process. If set to 0, all files will be processed. Defaults to 0.

  • format (str, optional) – The format of the sequence files (e.g., “fasta”, “fastq”). If not provided, it will be auto-detected. Defaults to “”.

  • workers (int, optional) – Number of workers to use for parallel processing. Defaults to 1.

  • filter (Path | list[str] | set[str] | None, None], optional) – A filter for selecting specific accessions. Defaults to None.

Returns:

None

add_sequence_from_file(accession: str, path: Path, format: str = '') None

Adds a single sequence from a file to the SeqBank.

This method processes a single sequence from a file in various formats (e.g., FASTA, FASTQ). The accession for this sequence is provided as an argument.

Parameters:
  • accession (str) – The accession key for the sequence to be stored under.

  • path (Path) – The path to the sequence file.

  • format (str, optional) – The format of the sequence file (e.g., “fasta”, “fastq”). If not provided, it will be auto-detected.

Returns:

None

add_url(url: str, progress=None, format: str = '', force: bool = False, overall_task=None, tmp_dir: str | Path | None = None) bool

Downloads and adds sequences from a URL to the SeqBank.

This method downloads a file from a given URL, processes it to extract sequences, and adds them to the SeqBank. If the URL has already been processed, it can be skipped unless force=True is provided.

Parameters:
  • url (str) – The URL to download the sequence file from.

  • progress (Progress, optional) – A rich progress bar to display progress of the download and file processing. Defaults to None.

  • format (str, optional) – The format of the sequence file (e.g., “fasta”, “fastq”). If not provided, it will be auto-detected. Defaults to “”.

  • force (bool, optional) – Whether to force downloading and processing the URL even if it has been seen before. Defaults to False.

  • overall_task (int | None, optional) – An optional task ID for tracking overall progress. Defaults to None.

  • tmp_dir (str | Path | None, optional) – A temporary directory to store the downloaded file. Defaults to None.

Returns:

True if the URL was successfully processed and added, False otherwise.

Return type:

bool

add_urls(urls: list[str], max: int = 0, format: str = '', force: bool = False, workers: int = -1, tmp_dir: str | Path | None = None) None

Downloads and adds sequences from a list of URLs to the SeqBank.

This method processes a list of URLs, downloads the corresponding sequence files, and adds them to the SeqBank. It filters out URLs that have already been processed unless force=True is specified, and it can limit the number of URLs processed based on the max argument. The processing can be parallelized using the workers argument.

Parameters:
  • urls (list[str]) – A list of URLs to download and process.

  • max (int, optional) – Maximum number of URLs to process. If set to 0, all URLs will be processed. Defaults to 0.

  • format (str, optional) – The format of the sequence files (e.g., “fasta”, “fastq”). If not provided, it will be auto-detected. Defaults to “”.

  • force (bool, optional) – Whether to force re-processing of URLs even if they were processed before. Defaults to False.

  • workers (int, optional) – Number of workers to use for parallel processing. If set to -1, all available CPU cores will be used. Defaults to -1.

  • tmp_dir (str | Path | None, optional) – A temporary directory to store downloaded files. Defaults to None.

Returns:

None

close() None

Closes the SeqBank database connection.

Attempts to close the database connection, if it’s open, and silently handles any exceptions.

copy(other: SeqBank) None

Copies all entries from the current SeqBank to another SeqBank instance.

This method iterates over all key-value pairs in the current SeqBank and adds them to the other SeqBank instance. The other SeqBank must be writable.

Parameters:

other (SeqBank) – The target SeqBank instance where entries will be copied.

Returns:

None

delete(accession: str) None

Deletes a sequence entry from the SeqBank by its accession.

Parameters:

accession (str) – The accession of the sequence to delete.

Returns:

None

export(output: Path | str, format: str = '', accessions: list[str] | str | Path | None = None) None

Exports the data from the SeqBank to a file using BioPython’s SeqIO.

Parameters:
  • output (Path | str) – The path or filename where the data should be exported.

  • format (str, optional) – The file format for exporting. If not specified, it will be inferred from the file extension.

  • accessions (list[str] | str | Path | None, optional) – A list of accessions to export. If a file path or string is provided, it will be read to obtain the list of accessions. If None, all accessions in the SeqBank are exported.

Returns:

None

property file: Rdict

Initializes and configures the Rdict database for sequence storage.

Configures options for the database, such as compression type, optimization, and maximum open files. Registers the close method to be executed upon program exit to ensure the database is closed.

Returns:

The configured Rdict database object.

Return type:

Rdict

get_accessions() set[str]

Retrieves all accessions stored in the SeqBank.

This method iterates through the SeqBank database keys and collects all accessions that do not belong to the internal ‘/seqbank/’ namespace.

Returns:

A set of all accessions present in the SeqBank.

Return type:

set[str]

histogram(nbins: int = 30, max: int = 0, min: int = 0) Figure

Creates a histogram of the lengths of all sequences and returns the Plotly figure object.

Parameters:
  • nbins (int) – The number of bins for the histogram. Default is 30.

  • max (int) – The maximum length of the sequence to include in the histogram. Default is all.

  • min (int) – The minimum length of the sequence to include in the histogram. Default is 0.

Returns:

A Plotly figure object representing the histogram of sequence lengths.

Return type:

go.Figure

items()

Yields all key-value pairs from the SeqBank.

The key is the accession, and the value is the sequence data in NumPy array format.

Yields:

tuple – A tuple containing the accession (key) and the sequence data (value) as a NumPy array.

key(accession: str) str

Generates a key for a given accession.

Parameters:

accession (str) – Accession string used to generate the key.

Returns:

A byte-encoded key as a string.

Return type:

str

key_url(url: str) str

Generates a key for a given URL.

Parameters:

url (str) – The URL for which the key is generated.

Returns:

A byte-encoded key string prefixed with ‘/seqbank/url/’.

Return type:

str

lengths_dict() dict[str, int]

Returns a dictionary where the keys are the accessions and the values are the corresponding lengths of each sequence.

Returns:

A dictionary mapping each accession to the length of its corresponding sequence.

Return type:

dict[str, int]

ls() None

Lists all accessions in the SeqBank.

Iterates through the keys in the SeqBank and prints each one, decoded from bytes to ASCII.

Returns:

None

missing(accessions: list[str] | set[str]) set[str]

Finds accessions that are not present in the SeqBank.

This method checks a list or set of accessions and returns those that are missing from the SeqBank.

Parameters:

accessions (list[str] | set[str]) – A list or set of accessions to check for presence in the SeqBank.

Returns:

A set of accessions that are missing from the SeqBank.

Return type:

set[str]

numpy(accession: str) ndarray

Retrieves the sequence data for a given accession and returns it as an unsigned char NumPy array.

Parameters:

accession (str) – The accession key for which the sequence data is retrieved.

Returns:

The sequence data associated with the given accession, represented as an unsigned char NumPy array.

Return type:

np.ndarray

record(accession: str) SeqRecord

Retrieves the sequence data for a given accession and returns it as a BioPython SeqRecord object.

Parameters:

accession (str) – The accession key for which the sequence data is retrieved.

Returns:

A BioPython SeqRecord object containing the sequence data, with the given accession as its ID and an empty description.

Return type:

SeqRecord

save_seen_url(url: str) None

Saves a URL as ‘seen’ by adding it to the SeqBank with a timestamp.

Parameters:

url (str) – The URL to save as seen.

Returns:

None

seen_url(url: str) bool

Checks if a given URL has been seen (i.e., processed) before and present in the SeqBank.

Parameters:

url (str) – The URL to check.

Returns:

True if the URL has been seen (i.e., exists in the SeqBank), otherwise False.

Return type:

bool

string(accession: str) str

Retrieves the sequence data for a given accession and returns it as a string.

Parameters:

accession (str) – The accession key for which the sequence data is retrieved.

Returns:

The sequence data associated with the given accession, represented as a string.

Return type:

str