Authors:

Firas BayramBestoun S. AhmedErik Hallin

Paper:

https://arxiv.org/abs/2408.06724

Introduction

In the modern industrial landscape, data has become a critical asset, driving the success of artificial intelligence (AI) and machine learning (ML) solutions. Ensuring the quality of this data is paramount for reliable decision-making. This paper introduces the Adaptive Data Quality Scoring Operations Framework, a novel approach designed to address the dynamic nature of data quality in industrial applications. By integrating a dynamic change detector mechanism, this framework ensures that data quality scores remain relevant and accurate over time.

Conceptual Background

Continuous Monitoring and Drift Detection

Continuous monitoring of data streams is essential to capture dynamic changes and maintain high data quality. Drift in data streams refers to changes in the statistical properties of the data over time. Detecting drift involves quantifying the dissimilarity between data distributions at different time points. The framework uses the Jensen-Shannon divergence to measure this dissimilarity, dynamically adjusting the threshold for drift detection to reflect evolving system conditions.

Data Quality Assurance

Data quality assurance involves evaluating various dimensions of data quality, such as accuracy, completeness, consistency, timeliness, and skewness. Each dimension provides a unique perspective on the data’s characteristics. The framework focuses on quantitative data scoring, providing valuable insights for industrial applications.

Related Work

Previous research has explored various methods for data quality assessment across different domains, including healthcare, finance, and IoT. Traditional methods often lack the adaptability required for dynamic industrial environments. Recent studies have introduced ML-based techniques for data quality scoring, but these methods often neglect the adaptive nature of data quality dimensions.

The Adaptive Data Quality Scoring Operations Framework

The proposed framework employs an ML-based approach to score data quality in industrial applications. It addresses the limitations of non-adaptive frameworks by dynamically adjusting data quality scores based on detected drifts in the data streams.

An Overview of ML-Based Data Quality Scoring

The ML-based scoring framework uses an ML predictor to generate a unified data quality score. The framework incorporates MLOps principles, ensuring continuous monitoring and validation of the ML model. The scoring process involves aggregating the calculated values of various data quality dimensions using Principal Component Analysis (PCA).

Development Phase

The development phase involves initializing system artifacts, such as the ML predictor, divergence values, anomaly detector, data distribution, and historical samples. These artifacts are essential for streamlining the solution in production.

Deployment Phase

In the deployment phase, the framework integrates the developed components into the operational system. The change detector assesses the occurrence of drift, triggering the adaptation process when necessary. This ensures that the ML model remains up-to-date with the evolving data characteristics.

Experimental Results

The framework was evaluated in a real-world industrial use case at Uddeholms AB, a leading steel manufacturer. The experiments assessed the framework’s predictive performance, processing time, and resource consumption.

Drift Detection Sensitivity Analysis

The sensitivity analysis of the drift detection mechanism revealed that the number of detected changes varies with different significance thresholds. A lower threshold results in fewer detections, while a higher threshold increases sensitivity.

Performance Analysis of DQ Scoring Predictions

The predictive performance of the ML model was evaluated using Mean Absolute Error (MAE) and R-squared (R2) metrics. The results showed that the adaptive approach improves predictive performance after executing adaptation mechanisms.


Time Required Analysis

The time required for different scoring methodologies was analyzed. The adaptive approach demonstrated significant improvements in processing efficiency compared to the standard and static approaches.

Analysis of Dynamic Data Quality Dimension Scores

The analysis of dynamic data quality dimensions, such as timeliness and skewness, showed substantial changes in scores after drift occurrences. This highlights the importance of adaptation in reflecting the evolving characteristics of the data.

Resource Consumption

The resource consumption analysis revealed that the adaptive approach consumes slightly more CPU compared to the static approach, while the memory usage is similar. The standard approach showed the lowest resource consumption.


Conclusion

The Adaptive Data Quality Scoring Operations Framework addresses the challenges of scoring dynamic data quality dimensions in industrial processes. By integrating a dynamic change detector, the framework ensures that data quality scores remain relevant and accurate over time. The experimental results demonstrate substantial improvements in processing time efficiency and predictive performance, making the framework a feasible solution for critical industrial applications.

Moving forward, the integration of this framework into broader data-driven AI systems will be explored, leveraging real-time data quality scores to enhance ML model training and decision-making processes in industrial environments.

Share.

Comments are closed.

Exit mobile version