Optimizing distributed data stream processing by tracing

Zvara, Zoltán and Szabó, Péter and Balázs, Barnabás Lóránt and Benczúr, András, ifj (2019) Optimizing distributed data stream processing by tracing. FUTURE GENERATION COMPUTER SYSTEMS, 90. pp. 578-591. ISSN 0167-739X 10.1016/j.future.2018.06.047

[img] Text
Zvara_578_30360609_z.pdf
Restricted to Registered users only

Download (2MB) | Request a copy
[img]
Preview
Text
Zvara_578_30360609_ny.pdf

Download (834kB) | Preview

Abstract

Heterogeneous mobile, sensor, IoT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed “on the fly”. Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported. By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging.

Item Type: Article
Uncontrolled Keywords: Distributed data processing; Apache Spark; Data stream processing; Distributed tracing; Data provenance;
Subjects: Q Science > QA Mathematics and Computer Science > QA75 Electronic computers. Computer science / számítástechnika, számítógéptudomány
Divisions: Informatics Laboratory
SWORD Depositor: MTMT Injector
Depositing User: MTMT Injector
Date Deposited: 08 Oct 2019 16:04
Last Modified: 17 Nov 2021 14:06
URI: https://eprints.sztaki.hu/id/eprint/9802

Update Item Update Item