Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction
Author:
Affiliation:
School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093, China
Fund Project:
摘要
|
图/表
|
访问统计
|
参考文献
|
相似文献
|
引证文献
|
资源附件
摘要:
针对文本聚类问题中因为维度灾难以及特征信息丢失而导致的聚类效果低效问题,本文提出一种基于特征矩阵优化与改进主成分分析(Principal component analysis, PCA)降维的聚类算法。在原基于文档频率和逆词频(Term frequency inverse document frequency, TF-IDF)算法的基础上提出ALFW(Adaptive length frequency weight)权重优化方案,使得特征矩阵的分布性更好,特征项的表征更加明显。在降维处理上,采用信息论中的联合熵标准对PCA算法进行了优化,提出UE-PCA(United entropy-PCA)算法对稀疏高维数据进一步降维,更好地保留了原高维数据的真实性。仿真实验表明,本文提出的算法(K-means+UE-PCA+ALFW)对比其他同类型算法取得了更好的表现效果。
Abstract:
Aiming at inefficient clustering due to dimensional disaster and loss of feature information in text clustering, this paper proposes a clustering algorithm based on feature matrix optimization and improved principal component analysis (PCA) dimensionality reduction. On the basis of the original term frequency inverse document frequency (TF-IDF) algorithm, an adaptive length frequency weight (ALFW) optimization scheme is proposed, which makes the distribution of the feature matrix better and the characterization of the feature terms more obvious. In the process of dimensionality reduction, the PCA algorithm is optimized by using the joint entropy standard in information theory, and the UE-PCA (United entropy-PCA) algorithm is proposed to further reduce the dimensionality of sparse high-dimensional data and better retain the authenticity of the original high-dimensional data. Simulation experiments show that the proposed algorithm (K-means + UE-PCA + ALFW) achieves better performance than other similar algorithms.
表 6 小样本轮廓系数对比表Table 6 Silhouette coefficient comparison table of small data set
表 4 数据集信息Table 4 Data set
表 5 大样本轮廓系数对比表Table 5 Silhouette coefficient comparison table of big data set
表 2 TF-IDF矩阵Table 2 TF-IDF matrix
图1 算法流程图Fig.1 Algorithm flowchart
图2 大样本数据集算法轮廓系数对比图Fig.2 Comparison of silhouette coefficient of large sample data set algorithm
图3 小样本数据集算法轮廓系数对比图Fig.3 Comparison of silhouette coefficient of small sample data set algorithm
表 1 特征频率矩阵Table 1 Frequency of characteristic matrix