Abstract
Data scientists have applied various analytic models and techniques to address the oft-cited problems of large volume, high velocity data rates and diversity in semantics. Such approaches have traditionally employed analytic techniques in a streaming or batch processing paradigm. This paper presents CRUCIBLE, a first-in-class framework for the analysis of large-scale datasets that exploits both streaming and batch paradigms in a unified manner. The CRUCIBLE framework includes a domain specific language for describing analyses as a set of communicating sequential processes, a common runtime model for analytic execution in multiple streamed and batch environments, and an approach to automating the management of cell-level security labelling that is applied uniformly across runtimes. This paper shows the applicability of CRUCIBLE to a variety of state-of-the-art analytic environments, and compares a range of runtime models for their scalability and performance against a series of native implementations. The work demonstrates the significant impact of runtime model selection, including improvements of between 2.3× and 480× between runtime models, with an average performance gap of just 14× between CRUCIBLE and a suite of equivalent native implementations.
Original language | English |
---|---|
Pages (from-to) | 738-753 |
Number of pages | 16 |
Journal | Parallel Computing |
Volume | 40 |
Issue number | 10 |
DOIs | |
Publication status | Published - Dec 2014 |
Bibliographical note
Publisher Copyright:© 2014 The Authors.
Keywords
- Analytics
- Data intensive computing
- Data science
- Domain specific languages
- Hadoop
- Streaming analysis
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Computer Graphics and Computer-Aided Design
- Artificial Intelligence