Data Quality in Data Streams by Modular Change Point Detection
Автор: Mugi Chu • Декабрь 22, 2023 • Доклад • 5,834 Слов (24 Страниц) • 108 Просмотры
Data Quality in Data Streams by Modular Change Point Detection
Yaron Kanza1, Rajat Malik1, Divesh Srivastava1, Caroline Stone1 and Gordon Woodhull1
1 AT&T Chief Data Office, New Jersey, USA
Abstract
Sensors that collect data from complex systems generate a stream of measurements, for example, measuring CPU utilization of machines in a data center, gathering meteorological data like atmospheric pressure and humidity levels across the USA, or tracking the occupancy of taxis in a large city. Downstream systems use the streamed data in a variety of applications, including training machine learning models and making data-driven decisions as part of automation. This makes data quality critical and requires detecting significant, unexpected, and rapid changes in indicative features of the streaming data. This can be done by detecting change points in the stream – points where the underlying distribution of a statistical feature of the stream fundamentally changes. In this paper, we discuss different types of change points in the data stream – changes that indicate a potential data quality problem. We present a modular method for combining operations on data streams to examine data quality in a flexible and adaptable way. Experiments over real-world and synthetic data streams show the effectiveness of the modular approach in comparison to traditional anomaly detection methods.
Keywords
Anomaly detection, change point detection, data streams, modular architecture
[pic 1]Introduction
When monitoring complex systems like cellular net- works, data centers, cloud infrastructures and content delivery networks, the monitoring system generates a data stream of telemetry, such as processing times, data transfer times, communication latency, CPU utilization, memory usage, network throughput, and other statistics that can help to track the health of the system. Moni- toring is also used for collecting meteorological data for weather forecasting, traffic data to regulate and mitigate congestion in highways and highly-used roads, tracking the operation of machines and facilities, and continuously gathering data for real-time systems.
Data streams are often analyzed to detect anomalies and irregularities. Anomalies and irregularities in the stream may indicate a problem in the underlying system or may reveal an event that requires intervention. Since the data in the stream is the basis for critical decisions, poor data quality may affect those decisions. In addition, collected data sets are often used for training machine learning models. The models are trained to learn the expected behavior of systems and applications. Thus, the data that is fed into these models in the training process
[pic 2]
Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — the 12th International Workshop on Quality in Databases (QDB’23), August 28 - September 1, 2023, Vancouver,
Canada
[pic 3] kanza@research.att.com (Y. Kanza); rmalik@research.att.com (R. Malik); divesh@research.att.com (D. Srivastava); cs831k@att.com (C. Stone); gordon@research.att.com
(G. Woodhull)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).[pic 4][pic 5]
CEUR Workshop Proceedings (CEUR-WS.org)
should be accurate and representative. This requires high data quality. Otherwise, the trained models could be biased or yield inaccurate results. The impact of data quality on machine learning is discussed in [1].
Maintaining high-quality data is crucial when criti- cal applications depend on the monitored system or on models that are trained over the data. This is essential in applications for forecasting events, and for detecting security attacks, frauds, outages, and the effect of natural events like storms on infrastructures and services.
Data quality has many aspects, including complete- ness (no missing data), consistency (the data does not lead to contradictory inferences), cleanliness (no noise), conformity (complying with standards and rules), and continuity (uniformity in the arrival of the data). Some of these aspects can be evaluated using standard anomaly detection tools, but only to a limited extent. Therefore, there is a need to combine a variety of tools for effective data-quality assurance.
There are many tools and methods for detecting anomalies (outliers) in streaming data [2]. Anomalies are values in the data stream that are significantly dif- ferent from the values that are expected based on pre- vious observations. Often, anomalies can indicate that the system does not function properly. However, most anomalies are ephemeral and can be ignored because by the time that they are noticed the system is already back to normal. So, it is often essential to focus on lasting changes in the data stream, detect them, and alert on them. This raises several questions. First, what type of changes should the system detect? Second, how should changes be detected? Third, how should the changes be reported to users in a way that is effective and actionable,
without overwhelming the user with too many alerts but also without missing critical alerts?
In this paper our focus is on detection of change points, that is, points where the underlying distribution of a sta- tistical feature of the stream changes in a significant, non- ephemeral, and unexpected way. We present a modular architecture for change point detection over streaming data, to provide flexibility and adaptability for a large variety of data streams and diverse use cases.
The paper is organized as follows. In Section 2 we discuss related work. Section 3 introduces quality mea- sures for data streams. In Section 4 we present methods for detection of change points. Section 5 describes our modular architecture and its benefits. Section 6 presents the results of our experimental evaluation. In Section 7 we discuss our conclusions and future work.
...