Credits

Robert Turnbull, Emily Fitzgerald, Karen Thompson, and Jo Birch from The University of Melbourne.

BioScience Cover

The paper describing the pipeline available in BioScience:

Turnbull, Robert, Emily Fitzgerald, Karen Thompson, and Joanne L. Birch. “Hespi: a Pipeline for Automatically Detecting Information from Herbarium Specimen Sheets.” BioScience (August 2025). DOI: 10.1093/biosci/biaf042.

You can also find the preprint of the paper on arXiv: arXiv:2410.08740.

Here is the BibTeX entry for the paper:

@article{hespi,
    author = {Turnbull, Robert and Fitzgerald, Emily and Thompson, Karen M and Birch, Joanne L},
    title = {Hespi: a pipeline for automatically detecting information from herbarium specimen sheets},
    journal = {BioScience},
    pages = {biaf042},
    year = {2025},
    month = {08},
    abstract = {Specimen-associated biodiversity data are crucial for biological, environmental, and conservation sciences. A rate shift is needed to extract data from specimen images efficiently, moving beyond human-mediated transcription. We developed Hespi (for herbarium specimen sheet pipeline) using advanced computer vision techniques to extract authoritative data applicable for a range of research purposes from primary specimen labels on herbarium specimens. Hespi integrates two object detection models: one for detecting the components of the sheet and another for fields on the primary specimen label. It classifies labels as printed, typed, handwritten, or mixed and uses optical character recognition and handwritten text recognition for extraction. The text is then corrected against authoritative taxon databases and refined using a multimodal large language model. Hespi accurately detects and extracts text from specimen sheets across international herbaria, and its modular design allows users to train and integrate custom models.},
    issn = {1525-3244},
    doi = {10.1093/biosci/biaf042},
    url = {https://doi.org/10.1093/biosci/biaf042},
    eprint = {https://academic.oup.com/bioscience/advance-article-pdf/doi/10.1093/biosci/biaf042/63667847/biaf042.pdf},
}

This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative. The authors thank collaborators Niels Klazenga, Heroen Verbruggen, Nunzio Knerr, Noel Faux, Simon Mutch, Babak Shaban, Andrew Drinnan, Michael Bayly and Hannah Turnbull.

Plant reference data obtained from the Australian National Species List (auNSL), as of March 2024, using the:

  • Australian Plant Name Index (APNI)

  • Australian Bryophyte Name Index (AusMoss)

  • Australian Fungi Name Index (AFNI)

  • Australian Lichen Name Index (ALNI)

  • Australian Algae Name Index (AANI)

and the World Flora Online Taxonomic Backbone v.2023.12, accessed 13 June 2024.

This pipeline depends on YOLOv8, torchapp, Microsoft’s TrOCR.

Logo derived from artwork by ka reemov.

BibTeX

@article{hespi,
    author = {Turnbull, Robert and Fitzgerald, Emily and Thompson, Karen M and Birch, Joanne L},
    title = {Hespi: a pipeline for automatically detecting information from herbarium specimen sheets},
    journal = {BioScience},
    pages = {biaf042},
    year = {2025},
    month = {08},
    abstract = {Specimen-associated biodiversity data are crucial for biological, environmental, and conservation sciences. A rate shift is needed to extract data from specimen images efficiently, moving beyond human-mediated transcription. We developed Hespi (for herbarium specimen sheet pipeline) using advanced computer vision techniques to extract authoritative data applicable for a range of research purposes from primary specimen labels on herbarium specimens. Hespi integrates two object detection models: one for detecting the components of the sheet and another for fields on the primary specimen label. It classifies labels as printed, typed, handwritten, or mixed and uses optical character recognition and handwritten text recognition for extraction. The text is then corrected against authoritative taxon databases and refined using a multimodal large language model. Hespi accurately detects and extracts text from specimen sheets across international herbaria, and its modular design allows users to train and integrate custom models.},
    issn = {1525-3244},
    doi = {10.1093/biosci/biaf042},
    url = {https://doi.org/10.1093/biosci/biaf042},
    eprint = {https://academic.oup.com/bioscience/advance-article-pdf/doi/10.1093/biosci/biaf042/63667847/biaf042.pdf}
}

@article{thompson2023_identification,
    author = {Thompson, Karen M. and Turnbull, Robert and Fitzgerald, Emily and Birch, Joanne L.},
    title = {{Identification of Herbarium Specimen Sheet Components From High-resolution Images using Deep Learning}},
    journal = {Ecology and Evolution},
    volume = {13},
    number = {8},
    pages = {e10395},
    doi = {https://doi.org/10.1002/ece3.10395},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/ece3.10395},
    note = {e10395 ECE-2023-05-00833.R1},
    year = {2023}
}

@misc{sheet_component_data,
    author = {Karen Thompson and Robert Turnbull and Emily Fitzgerald},
    title = {{Data available for `Identification of herbarium specimen sheet components from high-resolution images using deep learning': Annotations for selected MELU specimen sheet digital images}},
    year = {2023},
    month = {7},
    url = {https://melbourne.figshare.com/articles/dataset/_strong_Data_available_for_Identification_of_herbarium_specimen_sheet_components_from_high-resolution_images_using_deep_learning_Annotations_for_selected_MELU_specimen_sheet_digital_images_strong_/23597013},
    doi = {10.26188/23597013.v2}
}

This list of references is also available by using the following command:

hespi-tools bibtex