An Analysis of Characters and Structures of Web Pages Based on Regular Expressions
Authors
Xu Lei
Corresponding Author
Xu Lei
Available Online June 2014.
- DOI
- 10.2991/csss-14.2014.22How to use a DOI?
- Keywords
- information extraction; HTML; regular expressions
- Abstract
This paper introduces a method to analyze characters and structures of web pages via regular expressions. From encoding to HMTL elements, characters in Web pages are counted one by one. The effectiveness of this tool is proven in experiments with more than one hundred real-world web pages. All work can be ready for massive web information extraction.
- Copyright
- © 2014, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Xu Lei PY - 2014/06 DA - 2014/06 TI - An Analysis of Characters and Structures of Web Pages Based on Regular Expressions BT - Proceedings of the 3rd International Conference on Computer Science and Service System PB - Atlantis Press SP - 98 EP - 101 SN - 1951-6851 UR - https://doi.org/10.2991/csss-14.2014.22 DO - 10.2991/csss-14.2014.22 ID - Lei2014/06 ER -