Detecting duplicate records in scientific workflow results

Khalid Belhajjame*, Paolo Missier, Carole A. Goble

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.

Original languageEnglish
Title of host publicationProvenance and Annotation of Data and Processes - 4th International Provenance and Annotation Workshop, IPAW 2012, Revised Selected Papers
Pages126-138
Number of pages13
DOIs
Publication statusPublished - 2012
Event4th International Provenance and Annotation Workshop, IPAW 2012 - Santa Barbara, CA, United States
Duration: 19 Jun 201221 Jun 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7525 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference4th International Provenance and Annotation Workshop, IPAW 2012
Country/TerritoryUnited States
CitySanta Barbara, CA
Period19/06/1221/06/12

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Detecting duplicate records in scientific workflow results'. Together they form a unique fingerprint.

Cite this