Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction
CSTR:
Author:
Affiliation:

School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093, China

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Aiming at inefficient clustering due to dimensional disaster and loss of feature information in text clustering, this paper proposes a clustering algorithm based on feature matrix optimization and improved principal component analysis (PCA) dimensionality reduction. On the basis of the original term frequency inverse document frequency (TF-IDF) algorithm, an adaptive length frequency weight (ALFW) optimization scheme is proposed, which makes the distribution of the feature matrix better and the characterization of the feature terms more obvious. In the process of dimensionality reduction, the PCA algorithm is optimized by using the joint entropy standard in information theory, and the UE-PCA (United entropy-PCA) algorithm is proposed to further reduce the dimensionality of sparse high-dimensional data and better retain the authenticity of the original high-dimensional data. Simulation experiments show that the proposed algorithm (K-means + UE-PCA + ALFW) achieves better performance than other similar algorithms.

    Reference
    Related
    Cited by
Get Citation

CHEN Wei, LU Jiawei. Text Clustering Algorithm Based on Feature Matrix Optimization and Data Dimensionality Reduction[J].,2021,36(3):587-594.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:November 15,2019
  • Revised:August 27,2020
  • Adopted:
  • Online: May 25,2021
  • Published:
Article QR Code