模式识别讲座课件

资源描述

Pattern RecognitionNanyang Technological UniversityDr.Shi,DamingHarbin Engineering University1PatternRecognitionNanyangTec标题添加点击此处输入相关文本内容点击此处输入相关文本内容总体概述点击此处输入相关文本内容标题添加点击此处输入相关文本内容标题添加点击此处输入相点击此处输入总体概述点击此处输入标题添What is Pattern RecognitionnClassify raw data into the category of the pattern.nA branch of artificial intelligence concerned with the identification of visual or audio patterns by computers.For example character recognition,speech recognition,face recognition,etc.nTwo categories:syntactic(or structural)pattern recognition and statistical pattern recognitionIntroductionPattern Recognition=Pattern Classification3WhatisPatternRecognitionCla44What is Pattern RecognitionTraining PhaseTraining dataUnknown dataFeature ExtractionLearning(Feature selection,clustering,discriminant function generation,grammar parsing)Recognition(statistical,structural)ResultsRecognition PhaseKnowledge5WhatisPatternRecognitionTraWhat is Pattern RecognitionTraining PhaseTraining dataUnknown dataFeature ExtractionLearning(Feature selection,clustering,discriminant function generation,grammar parsing)Recognition(statistical,structural)ResultsRecognition PhaseKnowledge6WhatisPatternRecognitionTraCategorisationnBased on Application AreasnFace RecognitionnSpeech RecognitionnCharacter Recognitionnetc,etcnBased on Decision Making ApproachesnSyntactic Pattern RecognitionnStatistical Pattern RecognitionIntroduction7CategorisationBasedonApplicaSyntactic Pattern RecognitionAny problem is described with formal language,and the solution is obtained through grammatical parsingIn Memory of Prof.FU,King-Sun and Prof.Shu WenhaoIntroduction8SyntacticPatternRecognitionAStatistical Pattern RecognitionIn the statistical approach,each pattern is viewed as a point in a multi-dimensional space.The decision boundaries are determined by the probability distribution of the patterns belonging to each class,which must either be specified or learned.Introduction9StatisticalPatternRecognitioScope of the SeminarnModule 1 Distance-Based ClassificationnModule 2 Probabilistic ClassificationnModule 3 Linear Discriminant AnalysisnModule 4 Neural Networks for P.R.nModule 5 ClusteringnModule 6 Feature SelectionIntroduction10ScopeoftheSeminarModule1DModule 1 Distance-Based ClassificationNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition11Module1Distance-BasedClassiOverviewnDistance based classification is the most common type of pattern recognition techniquenConcepts are a basis for other classification techniquesnFirst,a prototype is chosen through training to represent a classnThen,the distance is calculated from an unknown data to the class using the prototypeDistance-Based Classification12OverviewDistancebasedclassifClassification by distanceObjects can be represented by vectors in a space.In training,we have the samples:In recognition,an unknown data is classified by distance:Distance-Based Classification13ClassificationbydistanceObjePrototypenTo find the pattern-to-class distance,we need to use a class prototype(pattern):(1)Sample Mean.For class ci,(2)Most Typical Sample.chooseSuch thatis minimized.Distance-Based Classification14PrototypeTofindthepattern-tPrototype Nearest Neighbour(3)Nearest Neighbour.chooseSuch thatis minimized.Nearest neighbour prototypes are sensitive to noise and outliers in the training set.Distance-Based Classification15PrototypeNearestNeighbour(Prototype k-NN(4)k-Nearest Neighbours.K-NN is more robust against noise,but is more computationally expensive.The pattern y is classified in the class of its k nearest neighbours from the training samples.The chosen distance determines how near is defined.Distance-Based Classification16Prototypek-NN(4)k-NearestDistance MeasuresnMost familiar distance metric is the Euclidean distancenAnother example is the Manhattan distance:nMany other distance measures Distance-Based Classification17DistanceMeasuresMostfamiliarMinimum Euclidean Distance(MED)ClassifierEquivalently,18MinimumEuclideanDistance(MEDecision BoundaryGiven a prototype and a distance metric,it is possible to find the decision boundary between classes.Linear boundaryNonlinear boundaryDecision Boundary=Discriminant FunctionDistance-Based Classificationlightnesslengthlightnesslength19DecisionBoundaryGivenaprotoExampleDistance-Based Classification20ExampleDistance-BasedClassifiExampleAny fish is a vector in the 2-dimensional space of width and lightness.fishDistance-Based Classificationlightnesslength21ExampleAnyfishisavectorinExampleDistance-Based Classification22ExampleDistance-BasedClassifiSummarynClassification by the distance from an unknown data to class prototypes.nChoosing prototype:nSample MeannMost Typical SamplenNearest NeighbournK-Nearest NeighbournDecision Boundary=Discriminant FunctionDistance-Based Classification23SummaryClassificationbythedModule 2 Probabilistic ClassificationNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition24Module2ProbabilisticClassifReview and Extend25ReviewandExtend25Maximum A Posterior(MAP)ClassifiernIdeally,we want to favour the class with the highest probability for the given pattern:Where P(Ci|x)is the a posterior probability of class Ci given x26MaximumAPosterior(MAP)ClasBayesian ClassificationnBayes Theoreom:Where P(x|Ci)is the class conditional probability density(p.d.f),which needs to be estimated from the available samples or otherwise assumed.Where P(Ci)is a priori probability of class Ci.Probabilistic Classification27BayesianClassificationBayesMAP ClassifiernBayesian Classifier,also known as MAP ClassifierSo,assign the pattern x to the class with maximum weighted p.d.f.Probabilistic Classification28MAPClassifierBayesianClassifAccuracy VS.RiskHowever,in the real world,life is not just about accuracy.In some cases,a small misclassification may result in a big disaster.For example,medical diagnosis,fraud detection.The MAP classifier is biased towards the most likely class.maximum likelihood classification.Probabilistic Classification29AccuracyVS.RiskHowever,intLoss FunctionOn the other hand,in the case of P(C1)P(C2),the lowest error rate can be attained by always classifying as C1Asolutionistoassignalosstomisclassification.which leads to Also known as the problem of imbalanced training data.Probabilistic Classification30LossFunctionOntheotherhandConditional RiskInstead of using the likelihood P(Ci|x),we use conditionalriskcost of action i given class j To minimize overall risk,choose the action with the lowest risk for the pattern:Probabilistic Classification31ConditionalRiskInsteadofusiConditional RiskProbabilistic Classification32ConditionalRiskProbabilisticExampleAssuming that the amount of fraudulent activity is about1%of the total credit card activity:C1=Fraud P(C1)=0.01C2=No fraud P(C2)=0.99If losses are equal for misclassification,then:Probabilistic Classification33ExampleAssumingthattheamounExampleHowever,losses are probably not the same.Classifying a fraudulent transaction as legitimate leads to direct dollar losses as well as intangible losses(e.g.reputation,hassles for consumers).Classifying a legitimate transaction as fraudulent inconveniences consumers,as their purchases are denied.This could lead to loss of future business.Lets assume that the ratio of loss for not fraud to fraud is 1 to 50,i.e.,A missed fraud is 50 times more expensive than accidentally freezing a card due to legitimate use.Probabilistic Classification34ExampleHowever,lossesareproExampleBy including the loss function,the decision boundaries change significantly.Instead of We use Probabilistic Classification35ExampleByincludingthelossfProbability Density FunctionRelatively speaking,its much easy to estimate a priori probability,e.g.simply takeTo estimate p.d.f.,we can(1)Assume a known p.d.f,and estimate its parameters(2)Estimate the non-parametric p.d.f from training samplesProbabilistic Classification36ProbabilityDensityFunctionReMaximum Likelihood Parameter EstimationnWithout the loss of generality,we consider Gaussian density.P(x|Ci)=Training examples for class CiParameter values to be identifiedWe are looking forthat maximize the likelihood,soThe sample covariance matrix!37MaximumLikelihoodParameterEDensity Estimationnif we do not know the specific form of the p.d.f.,then we need a different density estimation approach which is a non-parametric technique that uses variations of histogram approximation.(1)Simplest density estimation is to use“bins”.e.g.,in 1-D case,take the x-axis and divide into bins of length h.Estimate the probability of a sample in each bin.kN is the number of samples in the bin(2)Alternatively,we can take windows of unit volume and apply these windows to each sample.The overlap of the windows defines the estimated p.d.f.This technique is known as Parzen windows or kernels.Probabilistic Classification38DensityEstimationifwedonotSummarynBayesian Theoreom nMaximum A Posterior Classifier=Maximum Likelihood classifernDensity EstimationProbabilistic Classification39SummaryBayesianTheoreomProbaModule 3 Linear Discriminant AnalysisNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition40Module3LinearDiscriminantALinear Classifier-1A linearclassifier implements discriminant function or a decision boundary represented by a straight line in the multidimensional space.Given an input,x=(x1 xm)Tthe decision boundary of a linear classifier is given by a discriminant functionWith weight vector w=(w1 wm)TLDA41LinearClassifier-1AlinearLinear Classifier-2The output of the function f(x)for any input will depend upon the value of 1.weight vector and 2.input vector.For example,the following class definition may be employed:If f(x)0 Then x is Ballet dancerIf f(x)0 Then x is Rugby playerLDA42LinearClassifier-2TheoutpuLinear Classifier-3x1x2f(x)0f(x)Perceptronx=(x1 xm)Tw=(w1 wm)TInputsOutput Activation Function w2 w1 Linear Combiner bx2x1yLDA44Perceptronx=(x1xm)Tw=(wMulti-class problemLDA45Multi-classproblemLDA45Limitation of PerceptronA single-layer perceptron can perform pattern classification only on linearly separable patterns.(a)Linearly Separable Patterns(b)Non-linearly Separable PatternsLDA46LimitationofPerceptronAsingGeneralized Linear Discriminant FunctionsnDecision boundaries which separate between classes may not always be linearnThe complexity of the boundaries may sometimes request the use of highly non-linear surfacesnA popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as:LDAwhereis a nonlinear mapping function47GeneralizedLinearDiscriminanSummarynLinear classifiernVector analysisnPerceptronnPerceptron cannot classify linearly non-separable patternsnMLP,RBF,SVMLDA48SummaryLinearclassifierLDA48Module 4 Neural Networks for Pattern RecognitionNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition49Module4NeuralNetworksforPDetails in another seminar:Neural Networks50Detailsinanotherseminar:50Module 5 ClusteringNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition51Module5ClusteringNanyangTecSupervised Learning VS.unsupervised LearningClusteringSupervised Learning(The target output is known)For each training input pattern,the network is presentedwith the correct target answer(the desired output)by ateacher.Unsupervised Learning(The target output is unknown)For each training input pattern,the network adjusts weights without knowing the correct target.In unsupervised training,the network self-organizes to classify similar input patterns into clusters.52SupervisedLearningVS.unsupeClusteringnCluster:a set of patterns that are more similar to each other than to patterns not in the cluster.nGiven unlabelled samples and have no information about the classes.nWant to discover if there are any naturally occurring clusters in the data.nTwo approaches:nClustering by Distance MeasurenClustering by Density EstimationClustering53ClusteringCluster:asetofpaClustering by DistanceTwo issues:nHow to measure the similarity between samples?nHow to evaluate a partitioning of a set into clusters?Typical distance metrics include Euclidean Distance,Hamming Distance,etc.Clustering54ClusteringbyDistanceTwoissuGoodness of PartitioningnWe can use a measure of the scatter of each cluster to gauge how good the overall clustering is.nIn general,we would like compact clusters with a lot of space between them.nWe can use the measure of goodness to iteratively move samples from one cluster to another to optimize the groupingClustering55GoodnessofPartitioningWecanCriterion:sum of squared errornThis criterion defines clusters as their mean vectors mi in the sense that it minimizes the sum of the squared lengths of the error x-mi.nThe optimal partition is defined as one that minimizes Je,also called minimum variance partition.nWork fine when clusters form well separated compact clouds,less when there are great differences in the number of samples in different clusters.Clustering56Criterion:sumofsquarederroCriterion:ScatternScatter matrices used in multiple discriminant analysis,i.e.,the within-scatter matrix SW and the between-scatter matrix SB ST=SB+SWthat does depend only from the set of samples(not on the partitioning)nThe criteria can be to minimize the within-cluster or maximize the between-cluster scatternThe trace(sum of diagonal elements)is the simplest scalar measure of the scatter matrix,as it is proportional to the sum of the variances in the coordinate directionsClustering57Criterion:ScatterScattermatrIterative optimizationnOnce a criterion function has beem selected,clustering becomes a problem of discrete optimization.nAs the sample set is finite there is a finite number of possible partitions,and the optimal one can be always found by exhaustive search.nMost frequently,it is adopted an iterative optimization procedure to select the optimal partitionsnThe basic idea lies in starting from a reasonable initial partition and“move”samples from one cluster to another trying to minimize the criterion function.nIn general,this kinds of approaches guarantee local,not global,optimization.Clustering58IterativeoptimizationOnceaK-Means Clustering-1k-means clustering algorithm1)Initialization.t=0.Choose random values for the initial centers ck(t),k=1,K2)Sampling.Draw a sample from the training sample set3)Similarity matching.k(x)denote index of best matching center4)Updating.For every k=1,K5)Continuation.t=t+1,go back to step(2)until no noticeable changes are observedClustering59K-MeansClustering-1k-meansK-Means Clustering-2c1c2Clustering60K-MeansClustering-2c1c2ClusK-Means Clustering-3c1c3c2Clustering61K-MeansClustering-3c1c3c2ClClustering by Density Estimatione.g.Finding the nucleus and cytoplasm pels in white blood cells.Image Grey-level Histogram:Set =valley(local minimum)If value pel is cytoplasmIf value pel is nucleusthis is clustering based on density estimation.peaks=cluster centres.valleys=cluster boundariesClustering62ClusteringbyDensityEstimatiParameterized Density EstimationWe shall begin with parameterized p.d.f.,in which the only thing that must be learned is the value of an unknown parameter vectornWe make the following assumptions:nThe samples come from a known number c of classesnThe prior probabilities P(j)for each class are known nP(x|j,j)(j=1,c)are knownnThe values of the c parameter vectors 1,2,c are unknownClustering63ParameterizedDensityEstimatiMixture DensitynThe category labels are unknown,and this density function is called a mixture density,andnOur goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector.nOnce is known,we can decompose the mixture into its components and use a MAP classifier on the derived densities.Clustering64MixtureDensityThecategorylaChinese Ying-Yang PhilosophynEverything in the universe can be viewed as a product of a constant conflict between the opposites Ying and Yang.YingnegativefemaleinvisiblepositivemalevisibleYangn The optimal status is reached if Ying-Yang achieves harmonyClustering65ChineseYing-YangPhilosophyEvBayesian Ying-Yang ClusteringnTo find a clusters y to partition input data xnx is visible but y is invisiblenx decides y in training but y decides x in runningp(x,y)=p(y|x)p(x)p(x,y)=p(x|y)p(y)xyp(,)Clustering66BayesianYing-YangClusteringTBayesian Ying Yang Harmony Learning(1)nTo minimise the difference between the Ying-Yang pair:nTo select the optimal model(cluster number):whereClustering67BayesianYingYangHarmonyLeaBayesian Ying Yang Harmony Learning(2)nParameter learning using EM algorithmnE-Step:nM-Step:Clustering68BayesianYingYangHarmonyLeaSummarynClustering by DistancenGoodness of paretitioningnK-meansnClustering by Density EstimationnBYYClustering69SummaryClusteringbyDistanceCModule 6 Feature SelectionNanyang Technological UniversityDr.Shi,DamingHarbin Engineering UniversityPattern Recognition70Module6FeatureSelectionNanyMotivationFeature SelectionClassifier performance depend on a combination of the number of samples,number of features,and complexity of the classifier.Q1:The more samples,the better?Q2:The more features,the better?Q3:The more complex,the better?However,the number of samples is fixed when trainingBoth requires to reduce the number of features71MotivationFeatureSelectionClaCurse of Dimensionalityif the number of training samples is small relative to the number of features,the performance may be degraded.Because:with the increase of the number of features,the number of unknown parameters will be increased accordingly,then the reliability of the parameter estimation decreases.Feature Selection72CurseofDimensionalityiftheOcams Razor Hypothesisor plurality should not be posited without necessity.Pluralitas non est ponenda sine neccesitate-William of Ockham(ca.1285-1349).To make the system simpler,unnecessary features must be removed.Feature Selection73OcamsRazorHypothesisorpluFeature SelectionIn general,we would like to have a classifier to use a minimum number of dimensions,in order to achieve:-less computations-statistical estimation reliabilityFeature Selection:Given m measurements,choose nm best as featureWe require:1.A criterion to evaluate features2.An algorithm to optimize the criterionFeature Selection74FeatureSelectionIngeneral,wCriterionTypically,Interclass Distance(normalized by intraclass distance)2 classes:where mi1=mean of ith feature of class 1 si1=scatter(variance)of ith feature in class 1k classes:Feature Selection75CriterionTypically,InterclassOptimizingnChoosing n features from m measurements,the combinations are nUsually an exhaustive comparison is not feasible.nSome sub-optimal strategies include:nRank features by effectiveness and choose bestnIncrementally add features to set of chosen featuresnSuccessively add and delete features to chosen setFeature Selection76OptimizingChoosingnfeatures问题提问与解答问答HERE COMES THE QUESTION AND ANSWER SESSION问题提问与解答问答HERECOMESTHEQUESTI结束语CONCLUSION感谢参与本课程，也感激大家对我们工作的支持与积极的参与。课程后会发放课程满意度评估表，如果对我们课程或者工作有什么建议和意见，也请写在上边，来自于您的声音是对我们最大的鼓励和帮助，大家在填写评估表的同时，也预祝各位步步高升，真心期待着再次相会！结束语CONCLUSION感谢参与本课程，也感激大家对我感谢聆听Theusercandemonstrateonaprojectororcomputer,orprintthepresentationandmakeitintoafilm讲师：XXXX日期：20XX.X月感谢聆听讲师：XXXX日期：20XX.X月

展开阅读全文

模式识别讲座课件

最新文档