基于词性的文本预处理及其聚类我要分享

Text preprocessing based on POS and clustering

matlab

关注次数: 114

下载次数: 0

文件大小: 8.90 MB

代码分类: 其他

开发平台: matlab

下载需要积分: 2积分

版权声明:如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

代码描述

中文说明: 由于传统的文本特征提取是基于建停用词表(库)进行文本的特征选择,该方式在文本 篇幅和数量巨大的情况下,其建立的停用词表将非常庞大,文本特征词的提取效率非常 低下,本算法采用基于词性的文本特征提取,由于中文词性数量有限,因此文本特征提 取效率很高,便于后期的文本聚类或分类。本算法分词依赖中科大的分词组件ICTCLAS50, 我在matlab版本为2011b上成功编译运行。


English Description:

Because traditional text feature extraction based on build a stoplist (library) of feature selection for text, which is in the text Length and number of cases, the stoplist of its very large text feature extraction efficiency Low text feature extraction algorithm based on part of speech, due to the limited number of Chinese parts of speech, so text feature Preparation of highly efficient, easy to post text clustering or classification. The algorithm for segmentation of ustc-dependent phrases ICTCLAS50, In MATLAB version for 2011 Icompile and run successfully on b.


代码预览