[1]陈凯,黄英来,高文韬,等.一种基于属性加权补集的朴素贝叶斯文本分类算法[J].哈尔滨理工大学学报,2018,(04):69-74.[doi:10.15938/j.jhust.2018.04.013]
 CHEN Kai,HUANG Ying lai,GAO Wen tao,et al.AnImproved Naive Bayesian Text Classification Algorithmbased on Weighted Features and its Complementary Set[J].哈尔滨理工大学学报,2018,(04):69-74.[doi:10.15938/j.jhust.2018.04.013]
点击复制

一种基于属性加权补集的朴素贝叶斯文本分类算法
()
分享到:

《哈尔滨理工大学学报》[ISSN:1007-2683/CN:23-1404/N]

卷:
期数:
2018年04期
页码:
69-74
栏目:
计算机与控制工程
出版日期:
2018-08-25

文章信息/Info

Title:
AnImproved Naive Bayesian Text Classification Algorithm
based on Weighted Features and its Complementary Set
作者:
陈凯12黄英来1高文韬1赵鹏1
1.东北林业大学 信息与计算机工程学院,黑龙江 哈尔滨 150040; 2.哈尔滨地铁集团有限公司,黑龙江 哈尔滨 150000
Author(s):
CHEN Kai12HUANG Yinglai1GAO Wentao1ZHAO Peng1
1.Information and Computer Engineering College, Northeast Forestry University, Harbin 150040,China;2.Harbin Metro Group Co., Ltd., Harbin 150000, China
关键词:
关键词:属性加权文本分类朴素贝叶斯不均衡数据集
Keywords:
Keywords:feature weight text classification naive Bayes unbalanced dataset
分类号:
TP391
DOI:
10.15938/j.jhust.2018.04.013
文献标志码:
A
摘要:
摘要:针对文本训练集中各个类别的样本分布不均衡时,少数类别的特征会被多数类别的特征淹没的问题,提出一种属性加权补集的朴素贝叶斯文本分类算法,该算法使用属性加权改进补集朴素贝叶斯算法,使用TFIDF算法计算特征词在当前文档中的权重;利用当前类别补集的特征表示当前类别的特征并结合特征词在文档中的权重,解决分类器容易倾向大类别而忽略小类别的问题。与传统的朴素贝叶斯及补集朴素贝叶斯算法进行对比实验,结果表明:在样本集分布不均衡时,改进算法的性能表现最优,分类准确率、召回率及Gmean性能分别可达8292%、846%、8876%。
Abstract:
Abstract:When training samples of each class are distributed unevenly and sparsely, the features of smaller class cannot be adequately expressed and submerged by lager class, to solve this problem, a new method TFWCNB(TFIDF weighted complementary Nave Bayes) algorithm was proposed for unbalanced problem. TFWCNB used weighted features to improve the complement nave Bayes and TFIDF algorithm to calculate the feature word’s weight in the current document; in additional, it used features of current class’s complementary set to represent the features of current class,combining the feature word’s weight, it can solve the problem that the classifier tends to larger class and ignores the smaller class. The experimental results comparing with the traditional Nave Bayes and the complement Nave Bayes show that the TFWCNB algorithm has the best performance when the sample set is unevenly distributed, its classification precision, recall and gmean value can relatively reach 8292%, 846% and 8876%.

参考文献/References:

 [1]YEN S J,LEE Y S. Clusterbased Undersampling Approaches for Imbalanced Data Distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.
[2]CHAWLA N V, BOWYER K W,HALL L O. SMOTE:Synthetic Minority Oversampling Technique[J]. Journal of Artifigence Research, 2002, 6(2): 321-357.
[3]王超学,潘正茂,董丽丽,等. 基于改进SMOTE的非平衡数据集分类研究[J]. 计算机工程与应用, 2013, 49(2): 184-187.
[4]MCCALLUM A, NIGAM K. A Comparison of Event Models for Naive Bayes Text Classification[J]. IN AAAI98 Workshop on Learning for Text Categorization, 1998, 62(2):41-48.
[5]RENNIE J D M. Tackling the Poor Assumptions of Naive Bayes Text Classifiers[C]// Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003:616-623.
[6]JIANG Liangxiao, WANG Dianhong, CAI Zhihua. Discriminatively Weighted Naive BAYES and its Application in Text Classification[J].International Journal of Artificial Intelligence Tools, 2012, 21(1):3898-3898.
[7]JIANG Liangxiao, CAI Zhihua, ZHANG Harry, et al. Naive Bayes Text Classifiers: a Locally Weighted Learning Approach[J]. Journal of Experimental & Theoretical Artificial Intelligence, 2013, 25(2):273-286.
[8]LI Yanjun, LUO Congnan M, CHUNG Soon. Weighted Nave BAYES for Text Classification Using Positive Termclass Dependency[J]. International Journal of Artificial Intelligence Tools, 2012, 21(1):1659345.
[9]杨赫,孙广路,何勇军. 基于朴素贝叶斯模型的邮件过滤技术[J]. 哈尔滨理工大学学报, 2014, 19(1): 49-53.
[10]KG*2〗LIU P, YU H, XU T, et al. Research on Archives Text Classification Based on Naive Bayes[C]//Technology, Networking, Electronic and Automation Control Conference (ITNEC), 2017 IEEE 2nd Information. IEEE, 2017: 187-190.
[11]KG*2〗吕淑宝,王明月,翟祥,等. 一种深度学习的信息文本分类算法[J]. 哈尔滨理工大学学报, 2017, 22(02): 105-111.
[12]KG*2〗贺鸣,孙建军,成颖. 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学,2016,34(7):147-154.
[13]KG*2〗杜选. 基于加权补集的朴素贝叶斯文本分类算法研究[J]. 计算机应用与软件,2014,31(9):253-255.
[14]KG*2〗武建军,李昌兵. 基于互信息的加权朴素贝叶斯文本分类算法[J]. 计算机系统应用,2017,26(7):178-182.
[15]KG*2〗LIU P, YU H, XU T, et al. Research on Archives Text Classification Based on Naive Bayes[C]//Technology, Networking, Electronic and Automation Control Conference (ITNEC), 2017 IEEE 2nd Information. IEEE, 2017: 187-190.
[16]KG*2〗XU S, LI Y, WANG Z. Bayesian Multinomial Nave Bayes Classifier to Text Classification[C]//Advanced Multimedia and Ubiquitous Engineering. Springer, Singapore, 2017: 347-352.
[17]KG*2〗KO Y. How to Use Negative Class Information for Naive BAYES Classification[J]. Information Processing & Management, 2017, 53(6): 1255-1268.
[18]KG*2〗YOO J Y, YANG D. Classification Scheme of Unstructured Text Document Using TFIDF and Naive Bayes Classifier[J]. Advanced Scienceand Technology Letters, 2015(3): 263-266.
[19]KG*2〗LIN Y S, JIANG J Y, LEE S J. A Similarity Measure for Text Classification and Clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(7): 1575-1590.
[20]KG*2〗TRSTENJAK B, MIKAC S, DONKO D. KNN with TFIDF Based Framework for Text Categorization[J]. Procedia Engineering, 2014, 69: 1356-1364.
[21]KG*2〗DOMENICONI G, MORO G, PASOLINI R, et al. A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of Tf. idf[C]//International Conference on Data Management Technologies and Applications. Springer, Cham, 2015: 39-58.

相似文献/References:

[1]温雪岩,赵丽影,徐克生,等.改进的MDSMOTE与FCSVM在不平衡数据集分类中的应用[J].哈尔滨理工大学学报,2018,(04):87.[doi:10.15938/j.jhust.2018.04.016]
 WEN Xue yan,ZHAO Li ying,XU Ke sheng,et al. Application of Improved MDSMOTE and FCSVM in Imbalanced Data Set Classification[J].哈尔滨理工大学学报,2018,(04):87.[doi:10.15938/j.jhust.2018.04.016]

备注/Memo

备注/Memo:
基金项目:新世纪优秀人才基金(NCET120809);国家自然科学基金(31670717)
更新日期/Last Update: 2018-10-25