基于特征矩阵优化与数据降维的文本聚类算法
作者:
作者单位:

上海理工大学光电信息与计算机工程学院,上海 200093

作者简介:

通讯作者:

基金项目:


Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction
Author:
Affiliation:

School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    针对文本聚类问题中因为维度灾难以及特征信息丢失而导致的聚类效果低效问题,本文提出一种基于特征矩阵优化与改进主成分分析(Principal component analysis, PCA)降维的聚类算法。在原基于文档频率和逆词频(Term frequency inverse document frequency, TF-IDF)算法的基础上提出ALFW(Adaptive length frequency weight)权重优化方案,使得特征矩阵的分布性更好,特征项的表征更加明显。在降维处理上,采用信息论中的联合熵标准对PCA算法进行了优化,提出UE-PCA(United entropy-PCA)算法对稀疏高维数据进一步降维,更好地保留了原高维数据的真实性。仿真实验表明,本文提出的算法(K-means+UE-PCA+ALFW)对比其他同类型算法取得了更好的表现效果。

    Abstract:

    Aiming at inefficient clustering due to dimensional disaster and loss of feature information in text clustering, this paper proposes a clustering algorithm based on feature matrix optimization and improved principal component analysis (PCA) dimensionality reduction. On the basis of the original term frequency inverse document frequency (TF-IDF) algorithm, an adaptive length frequency weight (ALFW) optimization scheme is proposed, which makes the distribution of the feature matrix better and the characterization of the feature terms more obvious. In the process of dimensionality reduction, the PCA algorithm is optimized by using the joint entropy standard in information theory, and the UE-PCA (United entropy-PCA) algorithm is proposed to further reduce the dimensionality of sparse high-dimensional data and better retain the authenticity of the original high-dimensional data. Simulation experiments show that the proposed algorithm (K-means + UE-PCA + ALFW) achieves better performance than other similar algorithms.

    表 6 小样本轮廓系数对比表Table 6 Silhouette coefficient comparison table of small data set
    表 4 数据集信息Table 4 Data set
    表 5 大样本轮廓系数对比表Table 5 Silhouette coefficient comparison table of big data set
    表 2 TF-IDF矩阵Table 2 TF-IDF matrix
    图1 算法流程图Fig.1 Algorithm flowchart
    图2 大样本数据集算法轮廓系数对比图Fig.2 Comparison of silhouette coefficient of large sample data set algorithm
    图3 小样本数据集算法轮廓系数对比图Fig.3 Comparison of silhouette coefficient of small sample data set algorithm
    表 1 特征频率矩阵Table 1 Frequency of characteristic matrix
    表 3 ALFW矩阵Table 3 ALFW matrix
    参考文献
    相似文献
    引证文献
引用本文

陈玮,卢佳伟.基于特征矩阵优化与数据降维的文本聚类算法[J].数据采集与处理,2021,36(3):587-594

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2019-11-15
  • 最后修改日期:2020-08-27
  • 录用日期:
  • 在线发布日期: 2021-06-16