海量网站中博彩类违法网站的捕获方法
作者:
作者单位:

1.江苏警官学院计算机信息与网络安全系,南京 210031;2.江苏警官学院江苏省电子数据取证分析工程研究中心,南京 210031;3.江苏警官学院江苏省公安厅数字取证重点实验室, 南京 210031;4.江苏省公安厅网络安全保卫总队,南京 210024;5.南京市公安局大数据中心,南京 210005

作者简介:

通讯作者:

基金项目:

江苏省公安厅科技研究(2020KX008)资助项目;江苏省高等学校自然科学基金(19KJB510022)资助项目;江苏警官学院高层次引进人才科研启动基金资助项目。


Capture Methods of Gambling Related Illegal Websites in Massive Websites
Author:
Affiliation:

1.Department of Computer Information and Cyber Security, Jiangsu Police Institute, Nanjing 210031, China;2.Jiangsu Electronic Data Forensics and Analysis Engineering Research Center, Jiangsu Police Institute, Nanjing 210031, China;3.Key Laboratory of Digital Forensics of Jiangsu Provincial Public Security Department, Jiangsu Police Institute, Nanjing 210031, China;4.Cyber Security Guard Corps, Jiangsu Provincial Public Security Department, Nanjing 210024, China;5.Big Data Center, Nanjing Municipal Public Security Bureau, Nanjing 210005, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    针对海量网站中博彩类违法网站的检测问题,提出了一种基于BERT-BiLSTM与多分类器决策级融合的网站分类方法。该方法通过以下方式来提升分类性能:首先采用网页标签标题、超链接标题等优先的网页特征文本提取方法提升特征文本内容的丰富度;其次提出基于BERT-BiLSTM的文本分类模型,该模型具有良好的语句特征表示能力,从而提升分类性能;最后将网站标题、关键词和网页文本3种网站不同描述维度的分类结果进行决策级融合,进一步提升整个系统的性能与鲁棒性。通过采用多种策略生成疑似博彩网站的域名,提升该方法主动捕获博彩类违法网站的能力。实验结果以及在现实网络空间中的运行结果都充分验证了本文方法的有效性。

    Abstract:

    Aiming at the problem of detecting illegal gambling websites in massive websites, this paper proposes a classification method based on BERT-BiLSTM and multi-classifier decision-level fusion. This method improves the classification performance by adopting the following steps. Firstly, it extracts the textual information considered with high priority, i.e., meta information in HTML head and hyperlink titles on a web page, to enhance the richness of textual features. Secondly, a novel text classification model based on BERT-BiLSTM is designed, and it is proved superior in learning better sentence feature representatives and boosting performance. At last, the decision-level fusion is performed on the classification results from multiple dimensions (i.e., website title, keywords, and page text) to further improve the performance and robustness of the entire system. Moreover, a variety of strategies generating suspicious domain names are used to improve the ability to actively detect illegal websites. Experimental results and running results in real cyberspace demonstrate the effectiveness of the proposed method.

    表 3 决策级融合方法与单一分类方法分类性能对比Table 3 Classification performance comparison between decision level fusion and single classification method
    表 2 BERT-BiLSTM与其他组合模型的分类性能对比Table 2 Comparison of classification performance between BERT-BiLSTM with other combination models
    表 1 不同文本预处理方法分类性能对比Table 1 Classification performance comparison of different text preprocessing methods
    表 4 本文方法与其他方法的分类性能对比Table 4 Comparison of classification performance between the proposed method and other methods
    图1 博彩网站运营架构Fig.1 Operation structure of gambling website
    图2 本文方法流程图Fig.2 Flow chart of the proposed method
    图3 网络爬虫流程图Fig.3 Flow chart of web crawler
    图4 Transformer编码器模型Fig.4 Model of transformer encoder
    图5 多头注意力结构Fig.5 Multi-head attention struture
    图6 LSTM单元结构Fig.6 Cell structure of LSTM
    图7 BERT-BiLSTM模型Fig.7 BERT-BiLSTM model
    参考文献
    相似文献
    引证文献
引用本文

刘家银,印杰,牛博威,诸葛程晨,贺海辰.海量网站中博彩类违法网站的捕获方法[J].数据采集与处理,2021,36(5):1050-1061

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2020-10-09
  • 最后修改日期:2021-01-20
  • 录用日期:
  • 在线发布日期: 2021-10-22