基于特征矩阵优化与数据降维的文本聚类算法

doi:10.16337/j.1004-9037.2021.03.016

首页 > 按月查看>2021年第3月 >587-594. DOI:10.16337/j.1004-9037.2021.03.016

基于特征矩阵优化与数据降维的文本聚类算法
DOI:
                        10.16337/j.1004-9037.2021.03.016
                    
作者:
                        
                        
                    
作者单位:上海理工大学光电信息与计算机工程学院，上海 200093
作者简介:
通讯作者:
基金项目:

Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction

Author:

Affiliation:

School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

针对文本聚类问题中因为维度灾难以及特征信息丢失而导致的聚类效果低效问题，本文提出一种基于特征矩阵优化与改进主成分分析（Principal component analysis， PCA）降维的聚类算法。在原基于文档频率和逆词频（Term frequency inverse document frequency， TF-IDF）算法的基础上提出ALFW（Adaptive length frequency weight）权重优化方案，使得特征矩阵的分布性更好，特征项的表征更加明显。在降维处理上，采用信息论中的联合熵标准对PCA算法进行了优化，提出UE-PCA（United entropy-PCA）算法对稀疏高维数据进一步降维，更好地保留了原高维数据的真实性。仿真实验表明，本文提出的算法（K-means+UE-PCA+ALFW）对比其他同类型算法取得了更好的表现效果。

Abstract:

Aiming at inefficient clustering due to dimensional disaster and loss of feature information in text clustering， this paper proposes a clustering algorithm based on feature matrix optimization and improved principal component analysis （PCA） dimensionality reduction. On the basis of the original term frequency inverse document frequency （TF-IDF） algorithm， an adaptive length frequency weight （ALFW） optimization scheme is proposed， which makes the distribution of the feature matrix better and the characterization of the feature terms more obvious. In the process of dimensionality reduction， the PCA algorithm is optimized by using the joint entropy standard in information theory， and the UE-PCA （United entropy-PCA） algorithm is proposed to further reduce the dimensionality of sparse high-dimensional data and better retain the authenticity of the original high-dimensional data. Simulation experiments show that the proposed algorithm （K-means + UE-PCA + ALFW） achieves better performance than other similar algorithms.