An Ensemble Model of Dual Learning for Gambling and Pornographic Websites Classification
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The rapid proliferation of pornographic and gambling websites poses significant challenges, as these platforms increasingly employ sophisticated techniques to evade detection. Traditional classification approaches that rely on a single feature often fail to achieve high detection rates due to the diverse strategies these websites use to bypass detection systems. To address this limitation, this study introduces an ensemble model for classifying pornographic and gambling websites by integrating two key features: URLs and textual content. A webscraping script was developed to extract textual data from HTML elements of 3,000 websites, evenly distributed among benign, pornographic, and gambling categories, specifically curated for Thai users. The URLs undergo preprocessing to capture their meaningful semantic properties, which reflect the characteristics of the corresponding websites. Separate classifiers were then trained on each feature before being integrated into an ensemble model for final prediction. This approach achieved an outstanding accuracy of 96.83%, significantly surpassing single-feature classifiers. Moreover, the findings demonstrate the proposed model's robustness against obfuscation techniques and anti-crawling mechanisms, underscoring its potential for effective automated detection.