Understanding data quality: ensuring data quality by design in the rail industry

Qian Fu; John Easton

doi:10.1109/BigData.2017.8258380

Understanding data quality: ensuring data quality by design in the rail industry

Qian Fu, John Easton

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Citations (Scopus)

1138 Downloads (Pure)

Abstract

The railways worldwide are increasingly looking to the integration of their data resources coupled with advanced analytics to enhance traffic management, to provide new insights on the health of infrastructure assets, to provide soft linkages to other transport modes, and ultimately to enable them to better serve their customers. As in many industrial sectors, over the past decade the rail industry has been investing heavily in sensing technologies that record every aspect of the operation of the railway network. However, as any data scientist knows, it does not matter how good an algorithm is, if you put rubbish in, you get rubbish out; and as the traditional industry model of working with data only within the system that it was collected by becomes increasingly fragile, the industry is discovering that it knows less than it thought about the data it is gathering. When coupled with legacy data resources of unknown accuracy, such as design diagrams for assets that in many cases are decades old, the rail industry now faces a crisis in which its data may become essentially worthless due to a poor understanding of the quality of its data. This paper reports the findings of the first phase of a three-phase systematic review of literature about how data quality can be managed and evaluated in the rail domain. It begins by discussing why data quality matters in a rail context, before going on to define the quality, introduce and expand the concept of a data quality schema.

Original language	English
Title of host publication	Proceedings of the 2017 IEEE International Conference on Big Data (BIGDATA)
Publisher	IEEE Xplore
Pages	3792-3799
ISBN (Electronic)	9781538627150
DOIs	https://doi.org/10.1109/BigData.2017.8258380
Publication status	Published - 15 Jan 2018
Event	2017 IEEE International Conference on Big Data - Westin Copley Plaza Hotel, 10 Huntington Avenue, Boston, MA 02116, Boston, United States Duration: 11 Dec 2017 → 14 Dec 2017

Conference

Conference	2017 IEEE International Conference on Big Data
Abbreviated title	BigData 2017
Country/Territory	United States
City	Boston
Period	11/12/17 → 14/12/17

Keywords

Data quality
Rail
Quality by design
Data quality schema

Access to Document

10.1109/BigData.2017.8258380

Fu_Easton_Understanding_data_quality_IEEE_International_Conference_on_Big_Data_2017Accepted author manuscript, 223 KB

https://ieeexplore.ieee.org/document/8258380Licence: None: All rights reserved

Cite this

@inproceedings{dbf15b02238240788725a32c39576fcd,

title = "Understanding data quality: ensuring data quality by design in the rail industry",

abstract = "The railways worldwide are increasingly looking to the integration of their data resources coupled with advanced analytics to enhance traffic management, to provide new insights on the health of infrastructure assets, to provide soft linkages to other transport modes, and ultimately to enable them to better serve their customers. As in many industrial sectors, over the past decade the rail industry has been investing heavily in sensing technologies that record every aspect of the operation of the railway network. However, as any data scientist knows, it does not matter how good an algorithm is, if you put rubbish in, you get rubbish out; and as the traditional industry model of working with data only within the system that it was collected by becomes increasingly fragile, the industry is discovering that it knows less than it thought about the data it is gathering. When coupled with legacy data resources of unknown accuracy, such as design diagrams for assets that in many cases are decades old, the rail industry now faces a crisis in which its data may become essentially worthless due to a poor understanding of the quality of its data. This paper reports the findings of the first phase of a three-phase systematic review of literature about how data quality can be managed and evaluated in the rail domain. It begins by discussing why data quality matters in a rail context, before going on to define the quality, introduce and expand the concept of a data quality schema.",

keywords = "Data quality, Rail, Quality by design, Data quality schema",

author = "Qian Fu and John Easton",

year = "2018",

month = jan,

day = "15",

doi = "10.1109/BigData.2017.8258380",

language = "English",

pages = "3792--3799",

booktitle = "Proceedings of the 2017 IEEE International Conference on Big Data (BIGDATA)",

publisher = "IEEE Xplore",

note = "2017 IEEE International Conference on Big Data, BigData 2017 ; Conference date: 11-12-2017 Through 14-12-2017",

}

TY - GEN

T1 - Understanding data quality

T2 - 2017 IEEE International Conference on Big Data

AU - Fu, Qian

AU - Easton, John

PY - 2018/1/15

Y1 - 2018/1/15

N2 - The railways worldwide are increasingly looking to the integration of their data resources coupled with advanced analytics to enhance traffic management, to provide new insights on the health of infrastructure assets, to provide soft linkages to other transport modes, and ultimately to enable them to better serve their customers. As in many industrial sectors, over the past decade the rail industry has been investing heavily in sensing technologies that record every aspect of the operation of the railway network. However, as any data scientist knows, it does not matter how good an algorithm is, if you put rubbish in, you get rubbish out; and as the traditional industry model of working with data only within the system that it was collected by becomes increasingly fragile, the industry is discovering that it knows less than it thought about the data it is gathering. When coupled with legacy data resources of unknown accuracy, such as design diagrams for assets that in many cases are decades old, the rail industry now faces a crisis in which its data may become essentially worthless due to a poor understanding of the quality of its data. This paper reports the findings of the first phase of a three-phase systematic review of literature about how data quality can be managed and evaluated in the rail domain. It begins by discussing why data quality matters in a rail context, before going on to define the quality, introduce and expand the concept of a data quality schema.

AB - The railways worldwide are increasingly looking to the integration of their data resources coupled with advanced analytics to enhance traffic management, to provide new insights on the health of infrastructure assets, to provide soft linkages to other transport modes, and ultimately to enable them to better serve their customers. As in many industrial sectors, over the past decade the rail industry has been investing heavily in sensing technologies that record every aspect of the operation of the railway network. However, as any data scientist knows, it does not matter how good an algorithm is, if you put rubbish in, you get rubbish out; and as the traditional industry model of working with data only within the system that it was collected by becomes increasingly fragile, the industry is discovering that it knows less than it thought about the data it is gathering. When coupled with legacy data resources of unknown accuracy, such as design diagrams for assets that in many cases are decades old, the rail industry now faces a crisis in which its data may become essentially worthless due to a poor understanding of the quality of its data. This paper reports the findings of the first phase of a three-phase systematic review of literature about how data quality can be managed and evaluated in the rail domain. It begins by discussing why data quality matters in a rail context, before going on to define the quality, introduce and expand the concept of a data quality schema.

KW - Data quality

KW - Rail

KW - Quality by design

KW - Data quality schema

UR - https://ieeexplore.ieee.org/xpl/conhome/1802964/all-proceedings

U2 - 10.1109/BigData.2017.8258380

DO - 10.1109/BigData.2017.8258380

M3 - Conference contribution

SP - 3792

EP - 3799

BT - Proceedings of the 2017 IEEE International Conference on Big Data (BIGDATA)

PB - IEEE Xplore

Y2 - 11 December 2017 through 14 December 2017

ER -

Understanding data quality: ensuring data quality by design in the rail industry

Abstract

Conference

Keywords

Access to Document

Fingerprint

Cite this