ISSN : 2583-2646

AI-Enhanced Natural Language Processing for Improving Web Page Classification Accuracy

ESP Journal of Engineering & Technology Advancements
© 2024 by ESP JETA
Volume 4  Issue 1
Year of Publication : 2024
Authors : Dhruv Patel
:10.56472/25832646/JETA-V4I1P119

Citation:

Dhruv Patel, 2024. AI-Enhanced Natural Language Processing for Improving Web Page Classification Accuracy, ESP Journal of Engineering & Technology Advancements 4(1): 133-140.

Abstract:

Since the World Wide Web transformed the way of life, the number of web sites has grown rapidly. Techniques for classifying web pages must be created in order to correctly classify pages according to user queries. When it comes to classifying web pages, the text content is particularly important. They must be properly categorized in order to be used effectively, as the quantity of online sites is growing daily. This paper suggests a new Long Short-Term Memory (LSTM) network model that can better classify web text by capturing its sequential and contextual relationships. The WebKB dataset, consisting of web pages from several academic institutions, is used to evaluate the proposed approach. Features are extracted from structural elements such as titles, headings, body content, and URLs. Data experiments demonstrate the proposed LSTM model delivers an 88.73% overall accuracy, and precision rates reach 89.49%, as well as recall levels hit 88%, and F1-score performance marks 88.84%. The proposed LSTM-based approach delivers performance superior to Decision Trees and CNNs and BERT models and establishes itself as an effective solution for web page classification problems. Text sequences in the model structure enabled effective dependencies identification which led to performance enhancement in classification tasks. Web structure components added to the model made it more accurate and resilient in its performance.

References:

[1] A. Razali, S. M. Daud, N. A. M. Zin, and F. Shahidi, “Stemming text-based web page classification using machine learning algorithms: A comparison,” Int. J. Adv. Comput. Sci. Appl., 2020, doi: 10.14569/ijacsa.2020.0110171.

[2] L. Deng, X. Du, and J. Shen, “Web Page Classification Based on Heterogeneous Features and a Combination of Multiple Classifiers,” Front. Inf. Technol. Electron. Eng., vol. 21, no. 7, pp. 995–1004, Jul. 2020, doi: 10.1631/FITEE.1900240.

[3] S. H. Apandi, J. Sallim, R. Mohamed, and N. Ahmad, “Automatic Topic-Based Web Page Classification Using Deep Learning,” Int. J. Informatics Vis., 2023, doi: 10.30630/joiv.7.3-2.1616.

[4] E. Baykan, M. Henzinger, L. Marian, and I. Weber, “Purely URL-based Topic Classification Categories and Subject Descriptors,” Proc. 18th Int. World Wide Web Conf. (WWW 2009), 2009.

[5] E. Baykan, M. Henzinger, L. Marian, and I. Weber, “A comprehensive study of features and algorithms for URL-based topic classification,” ACM Trans. Web, 2011, doi: 10.1145/1993053.1993057.

[6] K. K. Nimavat and R. Kumar, “Updating Machine Learning Training Data Using Graphical Inputs,” 17178360, 2022

[7] S. Perazzoli, J. P. de Santana Neto, and M. J. M. B. de Menezes, “Systematic analysis of constellation-based techniques by using Natural Language Processing,” Technol. Forecast. Soc. Change, 2022, doi: 10.1016/j.techfore.2022.121674.

[8] W. Zhang et al., “Neuro-Inspired Language Models: Bridging the Gap between NLP and Cognitive Science,” 2023.

[9] T. B. Lalitha and P. S. Sreeja, “Potential Web Content Identification and Classification System using NLP and Machine Learning Techniques,” Int. J. Eng. Trends Technol., 2023, doi: 10.14445/22315381/IJETT-V71I4P235.

[10] H. Altaee, “Webpage Classification Using Ensemble Machine Learning,” Iraqi J. Intell. Comput. Informatics, 2023, doi: 10.52940/ijici.v2i1.27.

[11] A. H. Odeh, M. Odeh, H. Odeh, and N. Odeh, “Using Natural Language Processing for Programming Language Code Classification with Multinomial Naive Bayes,” Rev. d’Intelligence Artif., 2023, doi: 10.18280/ria.370515.

[12] D. Perdices, J. Ramos, J. L. García-Dorado, I. González, and J. E. López de Vergara, “Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities,” Comput. Networks, 2021, doi: 10.1016/j.comnet.2021.108357.

[13] A. Mulahuwaish, K. Gyorick, K. Z. Ghafoor, H. S. Maghdid, and D. B. Rawat, “Efficient classification model of web news documents using machine learning algorithms for accurate information,” Comput. Secur., 2020, doi: 10.1016/j.cose.2020.102006.

[14] Q. Zhao, W. Yang, and R. Hua, “Design and research of composite web page classification network based on deep learning,” in Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, 2019. doi: 10.1109/ICTAI.2019.00219.

[15] S. Markkandeyan and M. I. Devi, “Efficient Machine Learning Technique for Web Page Classification,” Arab. J. Sci. Eng., vol. 40, no. 12, pp. 3555–3566, Dec. 2015, doi: 10.1007/s13369-015-1844-1.

[16] R. Tarafdar and Y. Han, “Finding Majority for Integer Elements,” J. Comput. Sci. Coll., vol. 33, no. 5, pp. 187–191, 2018.

[17] A. Gupta and R. Bhatia, “Ensemble approach for web page classification,” Multimed. Tools Appl., 2021, doi: 10.1007/s11042-021-10891-3.

[18] S. Choudhury, T. Batra, and C. Hughes, “Content-based and link-based methods for categorical webpage classification,” pp. 1–9, 2018.

[19] V. Gokula Krishnan, J. Deepa, P. Venkateswara Rao, and V. Divya, “Web Page Classification Based on Novel Black Widow Meta-Heuristic Optimization with Deep Learning Technique,” in Lecture Notes on Data Engineering and Communications Technologies, 2022. doi: 10.1007/978-981-19-2347-0_15.

[20] A. K. Nandanwar and J. Choudhary, “Semantic features with contextual knowledge-based web page categorization using the glove model and stacked bilstm,” Symmetry (Basel)., vol. 13, no. 10, 2021, doi: 10.3390/sym13101772.

Keywords:

Web Page Classification, Text Mining, Information Retrieval, Web Mining, Machine Learning, Natural Language Processing, Webkb Dataset.