Research of Spark SQL Query Optimization Based on Massive Small Files on HDFS
Authors
Kefei Cheng, Xudong Chen, Ke Zhou, Xianjun Deng, Zhao Luo
Corresponding Author
Kefei Cheng
Available Online May 2019.
- DOI
- 10.2991/cnci-19.2019.25How to use a DOI?
- Keywords
- Industry Application Card data, Spark SQL, HDFS, Small Files, Parquet
- Abstract
This paper focuses on the low efficiency of Spark SQL reading massive small files on HDFS in 4G Industry Application Card (IAC) business analysis system. To solve this issue, we propose a Local Merge Storage Model (LMSM) for 4G IAC small files. In this model, locality is enhanced by exploring the type and time of small files. Then, Spark is used to merge small files into the Parquet column storage file and store them to HDFS. Finally, according to the experimental results, after merging partitioned storage of small files, Spark SQL query efficiency increases up to 60 times higher.
- Copyright
- © 2019, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Kefei Cheng AU - Xudong Chen AU - Ke Zhou AU - Xianjun Deng AU - Zhao Luo PY - 2019/05 DA - 2019/05 TI - Research of Spark SQL Query Optimization Based on Massive Small Files on HDFS BT - Proceedings of the 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019) PB - Atlantis Press SP - 180 EP - 190 SN - 2352-538X UR - https://doi.org/10.2991/cnci-19.2019.25 DO - 10.2991/cnci-19.2019.25 ID - Cheng2019/05 ER -