资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,模式识别,Pattern Recognition,Chapter 5,FEATURE SELECTION,01 十一月 2024,1,The goals:,Select the“optimum”number,l,of features,Select the“best”,l,features,Large,l,has a three-fold disadvantage:,High computational demands,Low generalization performance,Poor error estimates,FEATURE SELECTION,2,Given,N,l,must be,large enough,to learn,what makes classes,different,what makes patterns in the same class,similar,l,must be,small enough,not,to learn what makes patterns of the same class,different,In practice,has been reported to be a sensible choice for a number of cases,Once,l,has been decided,choose the,l,most informative features,Best:,Large,between class distance,Small,within class variance,3,4,The basic philosophy,(,基本思路,),Discard individual features with,poor,information content,(,丢弃信息贫乏的单个特征,),The remaining information rich features are examined,jointly,as vectors,(,剩余富信息特征作为向量联合考察,),Feature Selection based on statistical Hypothesis Testing,(,统计假设检验,),The Goal:,对每一单个特征,观察属于不同类是否因特征数值的大小起了重要的作用。,.,That is,answer,:The values differ significantly,(,特征可分,),:The values do not differ significantly,(,特征不可分,),If they do not differ significantly reject feature from subsequent stages.,Hypothesis Testing Basics,(,假设检验,),5,The steps:,N,measurementsare known,Define a function of them,test statistic,so that is easily parameterized in terms of,.,Let,D,be an interval,where,q,has a,high probability to lie under,H,0,i.e.,p,q,(,q,0,),Let,D,be the complement,(,补集,),of,D,D,Acceptance,Interval,D,Critical Interval,If,q,resulting from,lies in,D,we accept,H,0,otherwise we reject it.,6,Probability of an error,is,preselected,and it is known as the,significance level,(,显著水平,).,1-,7,Application:The known variance case:,Let,x,be a random variable and the experimental samples,are assumed mutually,independent,.Also let,Compute the sample mean,This is also a random variable with mean value,That is,it is an,Unbiased Estimator,8,The variance,Due to independence,That is,it is,Asymptotically Efficient,(,渐进有效,),Hypothesis test,Test Statistic:Define the variable,9,Central limit theorem,(,中心极限定理,),under,H,0,Thus,under,H,0,10,The decision,steps,Compute,q,from,x,i,i=,1,2,N,Choose significance level,(,置信水平,),Compute from,N,(0,1),tables,D,=-,x,x,An example:,A random variable,x,has variance,2,=,(0.23),2,.,=,16,measurements are obtained giving,.,The significance level is,=,0.05,.,Test the hypothesis,1-,11,Since,2,is known,is,N,(0,1),.,From tables,we obtain the values with acceptance intervals,-x,x,for normal,N,(0,1),Thus,1-,0.8,0.85,0.9,0.95,0.98,0.99,0.998,0.999,x,1.28,1.44,1.64,1.96,2.32,2.57,3.09,3.29,12,Since,lies,within the above,acceptance,interval,we accept,H,0,i.e.,The interval 1.237,1.463 is also known as confidence interval,(,置信区间,),at the,1,-,=,0.95,level.,We say that:There is no,evidence,at the 5%level that the mean value is not equal to,(,期望值以,5%,的不显著程度不等于,u),13,The Unknown Variance Case,Estimate the variance.The estimate,is,unbiased,i.e.,Define the test statistic,14,This is no longer Gaussian.If,x,is Gaussian,then,q,follows a,t-distribution,(t,-,分布,),with,N,-1 degrees of freedom,An example:,15,Table of acceptance intervals for t-distribution,Degrees of Freedom,1-,0.9,0.95,0.975,0.99,12,1.78,2.18,2.56,3.05,13,1.77,2.16,2.53,3.01,14,1.76,2.15,2.51,2.98,15,1.75,2.13,2.49,2.95,16,1.75,2.12,2.47,2.92,17,1.74,2.11,2.46,2.90,18,1.73,2.10,2.44,2.88,16,Application in Feature Selection,The goal here is to test against,zero,the,difference,1,-,2,of the respective means in,1,2,of a single feature.,Let,x,i,i=,1,N,the values of a feature in,1,Let,y,i,i=,1,N,the values,of the same,feature in,2,Assume in both classes,(unknown or not),The test becomes,17,Define,z=,x-y,Obviously,E,z,=,1,-,2,Define the average,Known Variance Case,:Define,This is,N,(0,1),and one follows the procedure as before.,18,Unknown Variance Case:,Define the test statistic,q,is t-distribution with,2,N-,2,degrees of freedom,Then apply appropriate tables as before.,Example:,The values of a feature in two classes are:,1,:,3.5,3.7,3.9,4.1,3.4,3.5,4.1,3.8,3.6,3.7,2,:,3.2,3.6,3.1,3.4,3.0,3.4,2.8,3.1,3.3,3.6,Test if the mean values in the two classes differ significantly,at the significance level,=,0.05,19,We have,For,N=,10,From the table of the t-distribution with,2,N-,2,=18 degrees of freedom and,=,0.05,we obtain,D=,-,2.10,2.10,and since,q=,4.25,is outside,D,H,1,is accepted and,the feature is selected.,20,Class Separability Measures,(,类可分性测量,p.113),至目前为止我们只强调了单个的独立特征,这样做就无法记及特征之间的互相关性,比如,两个特征都是富信息的,但是由于关联性的存在,我们没有必要两个特征都被关注。为了研究可能存在的相关性,我们必须把多个特征作为向量的元素联合地,(,综合,),考察。,To this end:,Discard poor in information features,by means of a statistical test,由统计检验丢弃贫信息特征,.,Choose the maximum number,of features to be used.This is dictated,(,规定,),by the specific problem(e.g.,the number,N,of available training patterns and the type of the classifier to be adopted).,21,Combine remaining features to sear
展开阅读全文