Locality-based Partitioning for Spark
Authors
Yuchong Xia, Fangfang Yang
Corresponding Author
Yuchong Xia
Available Online April 2017.
- DOI
- 10.2991/fmsmt-17.2017.233How to use a DOI?
- Keywords
- Spark, shuffle, locality, data skew.
- Abstract
Spark is a memory-based distributed data processing framework. Lots of data is transmitted through the network in the shuffle process, which is the main bottleneck of the Spark. Because the partitions are unbalanced in different nodes , the Reduce task input are unbalanced. In order to solve this problem, a partition policy based on task local level is designed to balance the task input. Finally, the optimization mechanism is verified by experiments, which can alleviate the data-skew and improve the efficiency of the job process.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Yuchong Xia AU - Fangfang Yang PY - 2017/04 DA - 2017/04 TI - Locality-based Partitioning for Spark BT - Proceedings of the 2017 5th International Conference on Frontiers of Manufacturing Science and Measuring Technology (FMSMT 2017) PB - Atlantis Press SP - 1188 EP - 1192 SN - 2352-5401 UR - https://doi.org/10.2991/fmsmt-17.2017.233 DO - 10.2991/fmsmt-17.2017.233 ID - Xia2017/04 ER -