Differentially Private String Sanitization for Frequency-Based Mining Tasks

Huiping Chen, Changyu Dong, Liyue Fan, Grigorios Loukides, Solon P. Pissis, Leen Stougie

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Strings are used to model genomic, natural language, and web activity data, and are thus often shared broadly. However, string data sharing has raised privacy concerns stemming from the fact that knowledge of length-k substrings of a string and their frequencies (multiplicities) may be sufficient to uniquely reconstruct the string; and from that the inference of such substrings may leak confidential information. We thus introduce the problem of protecting length-k substrings of a single string S by applying Differential Privacy (DP) while maximizing data utility for frequency-based mining tasks. Our theoretical and empirical evidence suggests that classic DP mechanisms are not suitable to address the problem. In response, we employ the order-k de Bruijn graph G of S and propose a sampling-based mechanism for enforcing DP on G. We consider the task of enforcing DP on G using our mechanism while preserving the normalized edge multiplicities in G. We define an optimization problem on integer edge weights that is central to this task and develop an algorithm based on dynamic programming to solve it exactly. We also consider two variants of this problem with real edge weights. By relaxing the constraint of integer edge weights, we are able to develop linear-time exact algorithms for these variants, which we use as stepping stones towards effective heuristics. An extensive experimental evaluation using real-world large-scale strings (in the order of billions of letters) shows that our heuristics are efficient and produce near-optimal solutions which preserve data utility for frequency-based mining tasks.

Original languageEnglish
Title of host publication2021 IEEE International Conference on Data Mining (ICDM)
EditorsJames Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages41-50
Number of pages10
ISBN (Electronic)9781665423984
ISBN (Print)9781665423991 (PoD)
DOIs
Publication statusPublished - 24 Jan 2022
Event21st IEEE International Conference on Data Mining, ICDM 2021 - Virtual, Online, New Zealand
Duration: 7 Dec 202110 Dec 2021

Publication series

NameProceedings. IEEE International Conference on Data Mining
PublisherIEEE
ISSN (Print)1550-4786
ISSN (Electronic)2374-8486

Conference

Conference21st IEEE International Conference on Data Mining, ICDM 2021
Country/TerritoryNew Zealand
CityVirtual, Online
Period7/12/2110/12/21

Bibliographical note

Funding Information:
HC is supported by a CSC scholarship, LF by the National Science Foundation CNS-1951430, GL by the Leverhulme Trust RPG-2019-399, and LS by NWO through Gravitation-grant NETWORKS-024.002.003. This paper is part of the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.

Publisher Copyright:
© 2021 IEEE.

Keywords

  • data sanitization
  • differential privacy
  • frequent pattern mining
  • string algorithms

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Differentially Private String Sanitization for Frequency-Based Mining Tasks'. Together they form a unique fingerprint.

Cite this