数据挖掘导论第4课数据分类和预测

资源描述

第4课数据分类和预测徐从富，副教授浙江大学人工智能研究所浙江大学本科生数据挖掘导论课件内容提纲n What is classification? What is prediction?n Issues regarding classification and predictionn Classification by decision tree inductionn Bayesian Classificationn Predictionn Summaryn Referencen Classification predicts categorical class labels (discrete or nominal)classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new datan Prediction models continuous-valued functions, i.e., predicts unknown or missing values n Typical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionI. Classification vs. PredictionClassificationA Two-Step Process n Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulaen Model usage: for classifying future or unknown objects Estimate accuracy of the modeln The known label of test sample is compared with the classified result from the modeln Accuracy rate is the percentage of test set samples that are correctly classified by the modeln Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not knownClassification Process (1): Model ConstructionTrainingDataNAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 noClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Classification Process (2): Use the Model in PredictionClassifierTestingDataNAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yesUnseen Data(Jeff, Professor, 4)Tenured?Supervised vs. Unsupervised Learningn Supervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setn Unsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the dataII. Issues Regarding Classification and Prediction (1): Data Preparationn Data cleaningPreprocess data in order to reduce noise and handle missing valuesn Relevance analysis (feature selection)Remove the irrelevant or redundant attributesn Data transformationGeneralize and/or normalize dataIssues regarding classification and prediction (2): Evaluating classification methodsn Accuracy: classifier accuracy and predictor accuracyn Speed and scalabilitytime to construct the model (training time)time to use the model (classification/prediction time)n Robustnesshandling noise and missing valuesn Scalabilityefficiency in disk-resident databases n Interpretabilityunderstanding and insight provided by the modeln Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rulesIII. Decision Tree Induction: Training Datasetage income student credit_rating buys_computer=30 high no fair no40 medium no fair yes40 low yes fair yes40 low yes excellent no3140 low yes excellent yes=30 medium no fair no40 medium yes fair yes40 medium no excellent noThis follows an example of Quinlans ID3 (Playing Tennis)Output: A Decision Tree for “buys_computer”age?overcaststudent? credit rating?no yes fairexcellent40no noyes yesyes30.40Algorithm for Decision Tree Inductionn Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)n Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples leftAttribute Selection Measure: Information Gain (ID3/C4.5)n Select the attribute with the highest information gainn S contains si tuples of class Ci for i = 1, , m n information measures info required to classify any arbitrary tuplen entropy of attribute A with values a1,a2,avn information gained by branching on attribute A21I( ) logm i i1 2 m i s ss,s,.,s s s1 11 .E(A) ( ,., )j mj j mjvj s s I s ss 1 2 mGain(A) I(s,s ,.,s ) E(A) Attribute Selection by Information Gain Computationg Class P: buys_computer = “yes”g Class N: buys_computer = “no”g I(p, n) = I(9, 5) =0.940g Compute the entropy for age: means “age =30” has 5 out of 14 samples, with 2 yeses and 3 nos. Hence,age pi ni I(pi, ni)40 3 2 0.971694.0)2,3(145 )0,4(144)3,2(145)( I IIageE 048.0)_( 151.0)( 029.0)( ratingcreditGain studentGain incomeGain 246.0)(),()( ageEnpIageGainage income student credit_rating buys_computer=30 high no fair no40 medium no fair yes40 low yes fair yes40 low yes excellent no3140 low yes excellent yes=30 medium no fair no40 medium yes fair yes40 medium no excellent no)3,2(145 ISimilarly,Computing Information-Gain for Continuous-Value Attributesn Let attribute A be a continuous-valued attributen Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split pointn (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for An Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A split-pointExtracting Classification Rules from Treesn Represent the knowledge in the form of IF-THEN rulesn One rule is created for each path from the root to a leafn Each attribute-value pair along a path forms a conjunctionn The leaf node holds the class predictionn Rules are easier for humans to understandn ExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buys_computer = “no”Avoid Overfitting in Classificationn Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samplesn Two approaches to avoid overfitting Prepruning: Halt tree construction earlydo not split a node if this would result in the goodness measure falling below a thresholdn Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” treeget a sequence of progressively pruned treesn Use a set of data different from the training data to decide which is the “best pruned tree”Approaches to Determine the Final Tree Sizen Separate training (2/3) and testing (1/3) setsn Use cross validationn Use all the data for trainingbut apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distributionn Enhancements to Basic Decision Tree Inductionn Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsn Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible valuesn Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replicationClassification in Large Databasesn Classificationa classical problem extensively studied by statisticians and machine learning researchersn Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speedn Why decision tree induction in data mining?relatively faster learning speed (than other classification methods)convertible to simple and easy to understand classification rulescan use SQL queries for accessing databasescomparable classification accuracy with other methodsScalable Decision Tree Induction Methodsn SLIQ (EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memoryn SPRINT (VLDB96 J. Shafer et al.)constructs an attribute list data structure n PUBLIC (VLDB98 Rastogi & Shim)integrates tree splitting and tree pruning: stop growing the tree earliern RainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)Presentation of Classification ResultsVisualization of a Decision Tree in SGI/MineSet 3.0Interactive Visual Mining by Perception-Based Classification (PBC)IV. Bayesian Classification: Why?n Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsn Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.n Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesn Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measuredBayesian Theorem: Basicsn Let X be a data sample whose class label is unknownn Let H be a hypothesis that X belongs to class C n For classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample Xn P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)n P(X): probability that sample data is observedn P(X|H): probability of observing the sample X, given that the hypothesis holdsBayesian Theoremn Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theoremn Informally, this can be written as posteriori = likelihood x prior / evidencen MAP (maximum posteriori) hypothesisn Practical difficulty: require initial knowledge of many probabilities, significant computational cost)( )()|()|( XP HPHXPXHP .)()|(maxarg)|(maxarg hPhDPHhDhPHhMAPh Nave Bayes Classifier n A simplified assumption: attributes are conditionally independent:n The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)n No dependence relation between attributes n Greatly reduces the computation cost, only count the class distribution.n Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci) * P(Ci) nk CixkPCiXP 1 )|()|(Training datasetage income student credit_rating buys_computer=30 high no fair no40 medium no fair yes40 low yes fair yes40 low yes excellent no3140 low yes excellent yes=30 medium no fair no40 medium yes fair yes40 medium no excellent noClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)Nave Bayesian Classifier: An Examplen Compute P(X|Ci) for each classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”Nave Bayesian Classifier: Commentsn Advantages Easy to implement Good results obtained in most of the casesn Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Nave Bayesian Classifiern How to deal with these dependencies? Bayesian Belief Networks Bayesian Belief Networksn Bayesian belief network allows a subset of the variables conditionally independentn A graphical model of causal relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution X YZ P qNodes: random variablesqLinks: dependencyqX,Y are the parents of Z, and Y is the parent of PqNo dependency between Z and PqHas no loops or cyclesBayesian Belief Network: An ExampleFamilyH istoryLungCancerPositiveXRay SmokerEmphysemaDyspnea LCLC (FH , S) (FH , S) (FH , S) (FH , S)0.80.2 0.50.5 0.70.3 0.10.9Bayesian Belief Networks The conditional probability table for the variable LungCancer:Shows the conditional probability for each possible combination of its parents ni ZParents iziPznzP 1 )(|(),.,1(Learning Bayesian Networksn Several casesGiven both the network structure and all variables observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to neural network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purposen D. Heckerman, Bayesian networks for data miningV. What Is Prediction?n (Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given inputn Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functionsn Major method for prediction: regression model the relationship between one or more independent or predictor variables and a dependent or response variablen Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression treesLinear Regression n Linear regression: involves a response variable y and a single predictor variable x，y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficients n Method of least squares: estimates the best-fitting straight linen Multiple linear regression: involves more than one predictor variable Training data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2 Solvable by extension of least square method or using SAS, S-Plus Many nonlinear functions can be transformed into the above | 1 2| 1 )( )(1 Di iDi ii xx yyxxw xwyw 10 n Some nonlinear models can be modeled by a polynomial functionn A polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible to linear with new variables: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 n Other functions, such as power function, can also be transformed to linear modeln Some models are intractable nonlinear (e.g., sum of exponential terms) possible to obtain least square estimates through extensive calculation on more complex formulaeNonlinear Regression n Generalized linear model: Foundation on which linear regression can be applied to modeling categorical response variables Variance of y is a function of the mean value of y, not a constant Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables Poisson regression: models the data that exhibit a Poisson distributionn Log-linear models: (for categorical data) Approximate discrete multidimensional prob. distributions Also useful for data compression and smoothingn Regression trees and model trees Trees to predict continuous values rather than class labelsOther Regression-Based ModelsRegression Trees and Model Treesn Regression tree: proposed in CART system (Breiman et al. 1984) CART: Classification And Regression Trees Each leaf stores a continuous-valued prediction It is the average value of the predicted attribute for the training tuples that reach the leafn Model tree: proposed by Quinlan (1992) Each leaf holds a regression modela multivariate linear equation for the predicted attribute A more general case than regression treen Regression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear modeln Predictive modeling: Predict data values or construct generalized linear models based on the database datan One can only predict value ranges or category distributionsn Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Predictionn Determine the major factors which influence the predictionData relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.n Multi-level prediction: drill-down and roll-up analysisPredictive Modeling in Multidimensional DatabasesPrediction: Numerical DataPrediction: Categorical DataVI. Summaryn Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. n Effective and scalable methods have been developed for decision trees induction, Naive Bayesian classification, Bayesian belief network, rule-based classifier, Backpropagation, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches.n Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction. VII. Referencen J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.n J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.n R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.). John Wiley & Sons, 2001.n T. M. Mitchell. Machine Learning. McGraw Hill, 1997.

展开阅读全文

数据挖掘导论第4课数据分类和预测

最新文档