fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data

Joseph T. Flannery‐Sutherland*, Nussaïbah B. Raja, Ádám T. Kocsis, Wolfgang Kiessling

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

31 Downloads (Pure)

Abstract

1. Fossil occurrence databases are indispensable resources to the palaeontological community, yet present unique data cleaning challenges. Many studies devote significant attention to cleaning fossil occurrence data prior to analysis, but such efforts are typically bespoke and difficult to reproduce. There are also no standardised methods to detect and resolve errors despite the development of an ecosystem of cleaning tools fuelled by the concurrent growth of neontological occurrence databases.
2. As fossil occurrence databases continue to increase in size, the demand for standardised, automated and reproducible methods to improve data quality will only grow. Here, we present semi-automated cleaning solutions to address these issues with a new R package fossilbrush. We apply our cleaning protocols to the Paleobiology Database to assess the prevalence of anomalous entries and the efficacy and impact of our methods.
2. We find that anomalies may be effectively resolved by comparison against a published compendium of stratigraphic ranges, improving the stratigraphic quality of the data, and through methods which detect outliers in taxon-wise occurrence stratigraphic distributions. Despite this, anomalous entries remain prevalent throughout major clades, with often more than 30% of genera in major fossil groups (e.g. bivalves, echinoderms) displaying stratigraphically suspect occurrence records.
4. Our methods provide a way to flag and resolve anomalous taxonomic data before downstream palaeobiological analysis and may also aid in the automation and targeting of future cleaning efforts. We stress, however, that our methods are semi-automated and are primarily for the detection of potential anomalies for further scrutiny, as full automation should not be a substitute for expert vetting. We note that some of our methods do not rely on external databases for anomaly resolution and so are also applicable to occurrences in neontological databases, expanding the utility of the fossilbrush R package.
Original languageEnglish
Pages (from-to)2404-2418
Number of pages15
JournalMethods in Ecology and Evolution
Volume13
Issue number11
Early online date26 Aug 2022
DOIs
Publication statusPublished - Nov 2022

Keywords

  • chronostratigraphy
  • data cleaning
  • fossil occurrence
  • palaeobiology database
  • Sepkoski Compendium
  • stratigraphic density

Fingerprint

Dive into the research topics of 'fossilbrush: An R package for automated detection and resolution of anomalies in palaeontological occurrence data'. Together they form a unique fingerprint.

Cite this