Deduplication for Data Profiling using Open Source Platform
- DOI
- 10.2991/icoiese-18.2019.48How to use a DOI?
- Keywords
- data preprocess, data governance, levensthein distance
- Abstract
Many companies still yet to know the importance of data quality for the company’s improvement. Many companies in Indonesia, especially BUMN and Government companies have only single application with single database, which cause a problem related to duplication of data between columns, tables and applications when the application is integrated with other applications. This problem can be handled by doing the data preprocess, one of the data preprocess method is data profiling. Data profiling is the process of gathering information that can be determined by process or logic. The process of profiling data can be done with various tools both paid and open source tools, each has advantages both in performance and in data processing according to the desired case study. In this study, the main focus is on data analysis by conducting data profiling using deduplication method called Levensthein Distance for check the duplicate data. The results of the profiling will be implemented in logical form in open source application and will do comparisons between open source applications.
- Copyright
- © 2019, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Margo Gunatama AU - Tien Fabrianti AU - Muhammad Azani Hasibuan PY - 2019/03 DA - 2019/03 TI - Deduplication for Data Profiling using Open Source Platform BT - Proceedings of the 2018 International Conference on Industrial Enterprise and System Engineering (IcoIESE 2018) PB - Atlantis Press SP - 272 EP - 276 SN - 2589-4943 UR - https://doi.org/10.2991/icoiese-18.2019.48 DO - 10.2991/icoiese-18.2019.48 ID - Gunatama2019/03 ER -