LearningfromImbalanced,OnlyPositiveandUnlabeledData

资源描述

Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Learning from Imbalanced, Only Positive and Unlabeled Data,Yetian Chen,04-29-2009,1,Outline,Introduction and Problem statement,2008 UC San Diego Data Ming Competition,Task 1: Supervised Learning from Imbalanced Data Sets,Over-sampling and Under-sampling,Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data,Two-step Strategy,2,Statement of Problems,2008 UC San Diego Data Ming Competition,Task 1:Standard Binary Classification,A,binary,classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly,ten times,as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples.,Task 2:Positive-Only Semi-Supervised Task,also a,binary,classification task, but most of the training examples are,unlabeled,. In fact, only a few of the,positive,examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.,3,Task 1: Learning from Imbalanced Data,Class imbalance is prevalent in many applications:,fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc.,Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones,i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class,4,Solutions to Class Imbalance Problem,At the data level (re-samplings),Over-sampling,: increases the number of minority instances by over-sampling them,Under-sampling,: extract a smaller set of majority instances while preserving all the minority instances,At the algorithmic level,Cost-sensitive based,: adjust the costs of the various classes so as to counter the class imbalance,5,Over-sampling,SMOTE,: Synthetic Minority Over-sampling Technique,The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors.,Over-sampling by duplicating the minority examples,6,Under-sampling,Randomly,select a subset from the majority class. The size of the subset is roughly equal to the size of minority class.,After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the,accuracies,.,Decision Tree, Nave Bayes, Neural Network(one hidden layer),7,Results for Task 1,For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.,8,My Ranking (,52,th,/199,),9,Conclusion for Task 1,For Nave Bayes classifiers, re-sampling does not improve the accuracy significantly.,For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy.,For Neural Network, all three re-sampling techniques significantly improve the accuracy,Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques.,10,Task 2: Learning from Only Positive and Unlabeled Data,Positive examples,: One has a set of examples of a class,P, and,Unlabeled set,: also has a set,U,of unlabeled (or mixed) examples with instances from P and also not from P (,negative examples,).,Build a classifier,: Build a classifier to classify the examples in U and/or future (test) data.,Key feature of the problem,: no labeled negative training data.,We call this problem,PU-learning,.,11,Examples in Real Life,Specialized molecular biology database.,Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set.,Learning users preference for web pages:,The users bookmarks can be considered as positive examples,All the rest web pages are unlabeled examples,Direct marketing:,companys current list of customers as positive examples,Text classification: labeling is labor intensive,12,Are Unlabeled Examples Helpful?,Function known to be either x,1, 0,Which one is it?,x,1, 0,+,+,+,+,+,+,+,+,+,u,u,u,u,u,u,u,u,u,u,u,“Not learnable” with only positiveexamples. However, addition ofunlabeled examples makes it learnable.,13,Two-step strategy,Step 1: Identifying a set of,reliable negative,examples from the unlabeled set.,S-EM Liu et al, 2002 uses a Spy technique,PEBL Yu et al, 2002 uses a 1-DNF technique,Roc-SVM Li & Liu, 2003 uses the Rocchio algorithm.,Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.,S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism,PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection.,Roc-SVM uses SVM with a heuristic method for selecting the final classifier.,14,Step 1 Step 2,positive,negative,Reliable,Negative,(RN),Q,=U - RN,U,P,positive,Using P, RN and Q to build the final classifier iteratively,or,Using only P and RN to build a classifier,15,Step 1: The Spy technique,Sample a certain % of positive examples and put them into unlabeled set to act as,“spies”.,Run a classification algorithm assuming all unlabeled examples are negative,We will know the behavior of those actual positive examples in the unlabeled set through the “spies”.,Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a,probabilistic class label,We can then extract reliable negative examples from the unlabeled set more accurately.,16,Step 1: The Spy technique,17,Step 2: Building the final classifier,Use Nave Bayes classifiers to build the final classifier,Use P as the positive class, use N (reliable negative examples) as the negative class,18,Results and Conclusion for Task 2,Use P as positive class, use U as the negative class, use,SMOTE,to over-sample P so that the size of P is roughly the same as U, the F1 score =,0.545,Two-step algorithm gives F1 score =,0.651,The highest score is F1=0.721,Only positive and unlabeled data is learnable with the two-step strategy.,19,Future Work,For task 1, we can try,Cost-sensitive based,method,For task 2, two-step strategy,Step 1: 1-DNF, Rocchio algorithm,Step2: SVM,20,References,B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179188, 2003.,B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia.,Wee Sun Lee, Bing Liu.,Learning with Positive and Unlabeled Examples using Weighted Logistic Regression,. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.,Giang Hoang Nguyen, Abdesselam Bouzerdoum,Son Lam Phung,: A supervised learning approach for imbalanced data sets.,ICPR 2008,: 1-4,Nitesh V. Chawla,Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets.,SIGKDD Explorations 6,(1): 1-6 (2004),Nitesh V. Chawla et. al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research . Vol.16, pp.321-357.,21,Thank you!,22,

展开阅读全文

LearningfromImbalanced,OnlyPositiveandUnlabeledData

最新文档