TY - GEN
T1 - Detecting duplicate records in scientific workflow results
AU - Belhajjame, Khalid
AU - Missier, Paolo
AU - Goble, Carole A.
PY - 2012
Y1 - 2012
N2 - Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.
AB - Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.
UR - http://www.scopus.com/inward/record.url?scp=84868292523&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-34222-6_10
DO - 10.1007/978-3-642-34222-6_10
M3 - Conference contribution
AN - SCOPUS:84868292523
SN - 9783642342219
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 126
EP - 138
BT - Provenance and Annotation of Data and Processes - 4th International Provenance and Annotation Workshop, IPAW 2012, Revised Selected Papers
T2 - 4th International Provenance and Annotation Workshop, IPAW 2012
Y2 - 19 June 2012 through 21 June 2012
ER -