LearningfromImbalanced,OnlyPositiveandUnlabeledData

上传人:xx****x 文档编号:242869123 上传时间:2024-09-10 格式:PPT 页数:22 大小:414.50KB
返回 下载 相关 举报
LearningfromImbalanced,OnlyPositiveandUnlabeledData_第1页
第1页 / 共22页
LearningfromImbalanced,OnlyPositiveandUnlabeledData_第2页
第2页 / 共22页
LearningfromImbalanced,OnlyPositiveandUnlabeledData_第3页
第3页 / 共22页
点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Learning from Imbalanced, Only Positive and Unlabeled Data,Yetian Chen,04-29-2009,1,Outline,Introduction and Problem statement,2008 UC San Diego Data Ming Competition,Task 1: Supervised Learning from Imbalanced Data Sets,Over-sampling and Under-sampling,Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data,Two-step Strategy,2,Statement of Problems,2008 UC San Diego Data Ming Competition,Task 1:Standard Binary Classification,A,binary,classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly,ten times,as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples.,Task 2:Positive-Only Semi-Supervised Task,also a,binary,classification task, but most of the training examples are,unlabeled,. In fact, only a few of the,positive,examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.,3,Task 1: Learning from Imbalanced Data,Class imbalance is prevalent in many applications:,fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc.,Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones,i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class,4,Solutions to Class Imbalance Problem,At the data level (re-samplings),Over-sampling,: increases the number of minority instances by over-sampling them,Under-sampling,: extract a smaller set of majority instances while preserving all the minority instances,At the algorithmic level,Cost-sensitive based,: adjust the costs of the various classes so as to counter the class imbalance,5,Over-sampling,SMOTE,: Synthetic Minority Over-sampling Technique,The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors.,Over-sampling by duplicating the minority examples,6,Under-sampling,Randomly,select a subset from the majority class. The size of the subset is roughly equal to the size of minority class.,After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the,accuracies,.,Decision Tree, Nave Bayes, Neural Network(one hidden layer),7,Results for Task 1,For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.,8,My Ranking (,52,th,/199,),9,Conclusion for Task 1,For Nave Bayes classifiers, re-sampling does not improve the accuracy significantly.,For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy.,For Neural Network, all three re-sampling techniques significantly improve the accuracy,Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques.,10,Task 2: Learning from Only Positive and Unlabeled Data,Positive examples,: One has a set of examples of a class,P, and,Unlabeled set,: also has a set,U,of unlabeled (or mixed) examples with instances from P and also not from P (,negative examples,).,Build a classifier,: Build a classifier to classify the examples in U and/or future (test) data.,Key feature of the problem,: no labeled negative training data.,We call this problem,PU-learning,.,11,Examples in Real Life,Specialized molecular biology database.,Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set.,Learning users preference for web pages:,The users bookmarks can be considered as positive examples,All the rest web pages are unlabeled examples,Direct marketing:,companys current list of customers as positive examples,Text classification: labeling is labor intensive,12,Are Unlabeled Examples Helpful?,Function known to be either x,1, 0,Which one is it?,x,1, 0,+,+,+,+,+,+,+,+,+,u,u,u,u,u,u,u,u,u,u,u,“Not learnable” with only positiveexamples. However, addition ofunlabeled examples makes it learnable.,13,Two-step strategy,Step 1: Identifying a set of,reliable negative,examples from the unlabeled set.,S-EM Liu et al, 2002 uses a Spy technique,PEBL Yu et al, 2002 uses a 1-DNF technique,Roc-SVM Li & Liu, 2003 uses the Rocchio algorithm.,Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.,S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism,PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection.,Roc-SVM uses SVM with a heuristic method for selecting the final classifier.,14,Step 1 Step 2,positive,negative,Reliable,Negative,(RN),Q,=U - RN,U,P,positive,Using P, RN and Q to build the final classifier iteratively,or,Using only P and RN to build a classifier,15,Step 1: The Spy technique,Sample a certain % of positive examples and put them into unlabeled set to act as,“spies”.,Run a classification algorithm assuming all unlabeled examples are negative,We will know the behavior of those actual positive examples in the unlabeled set through the “spies”.,Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a,probabilistic class label,We can then extract reliable negative examples from the unlabeled set more accurately.,16,Step 1: The Spy technique,17,Step 2: Building the final classifier,Use Nave Bayes classifiers to build the final classifier,Use P as the positive class, use N (reliable negative examples) as the negative class,18,Results and Conclusion for Task 2,Use P as positive class, use U as the negative class, use,SMOTE,to over-sample P so that the size of P is roughly the same as U, the F1 score =,0.545,Two-step algorithm gives F1 score =,0.651,The highest score is F1=0.721,Only positive and unlabeled data is learnable with the two-step strategy.,19,Future Work,For task 1, we can try,Cost-sensitive based,method,For task 2, two-step strategy,Step 1: 1-DNF, Rocchio algorithm,Step2: SVM,20,References,B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179188, 2003.,B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.,Wee Sun Lee, Bing Liu.,Learning with Positive and Unlabeled Examples using Weighted Logistic Regression,. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.,Giang Hoang Nguyen, Abdesselam Bouzerdoum,Son Lam Phung,: A supervised learning approach for imbalanced data sets.,ICPR 2008,: 1-4,Nitesh V. Chawla,Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets.,SIGKDD Explorations 6,(1): 1-6 (2004),Nitesh V. Chawla et. al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research . Vol.16, pp.321-357.,21,Thank you!,22,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!