贝叶斯学习综述课件

资源描述

机器学习与人工神经网络Machine learning and Artificial Neural Networks西安交通大学电信学院自动化系杜友田2第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络 3.4.1 贝叶斯网络概念 3.4.2 贝叶斯网络推理第三章第三章贝叶斯贝叶斯学习方法学习方法3.1 极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络3极大似然参数估计所谓极大似然法(maximum likelihood method)是指选择使事件发生概率最大的可能情况的参数估计方法。极大似然法包括2个步骤：1）建立包括有该参数估计量的似然函数(likelihood function)2）根据实验数据求出似然函数达极值时的参数估计量或估计值Maximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络4 对于离散型随机变量，似然函数是多个独立事件的概率函数的乘积，该乘积是概率函数值，它是关于总体参数的函数。例如，一只大口袋里有红、白、黑3种球，采用复置抽样50次，得到红、白、黑3种球的个数分别为12，24，14，那么根据多项式的理论，可以建立似然函数为：其中p1，p2，p3分别为口袋中红、白、黑3种球的概率(p3=1p1p2)，它们是需要估计的。Maximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络5对于连续型随机变量，似然函数是每个独立随机观测值的概率密度函数的乘积，则似然函数为：若yi 服从正态分布，则，上式可变为：Maximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络6为了计算上的方便，一般将似然函数取对数，称为对数似然函数，因为取对数后似然函数由乘积变为加式，其表达式为：求极大似然估计量可以通过令对数似然函数对总体参数的偏导数等于0来获得，即当，有由此获得总体参数的极大似然估计量。Maximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络7设y1,y2,yn是正态总体的随机样本，求正态分布参数的极大似然估计量。Maximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络8Use parameterized pdfs,e.g.,Gaussian or mixtures of GaussiansCriteria for parameter estimationMaximum LikelihoodMaximum likelihood estimation(MLE)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络9Gaussian caseUnknown meanUnkown mean andcovariance matrixDifferentiate wrt means and variances,set to zero and solve!Gaussian mixture caseDifferentiating wrt weights,means,and covariances leads to an iterative algorithm第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络10Assume there exist hidden or missing data Instead of maximizing the original incomplete likelihoodconsider the complete likelihoodExpectation Maximization(EM)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络11Jensens inequality:第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络121)Maximizes ,therefore ,Hence,at each iteration,cannot decrease.2)When EM reaches a fixed point at some (a maximum of ),provided and are differentiable,must also be a stationary point of -not necessarily a local maximum.PDF Estimation第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络13第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络14The EM algorithm increases the complete likelihood in each iterationE-step:M-step:Expectation Maximization(EM)第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络15Introduce yi,if sample xi is generated from the k-th component,yi=kWhere 第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络16is the given parameterscan be thought of the prior probability of the j-th component,that is 第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络17第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络18第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络19第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络20第三章贝叶斯学习方法3.1 极大似然估计极大似然估计3.2 贝叶斯学习3.3 朴素贝叶斯方法3.4 贝叶斯网络21Goal:To determine the most probable hypothesis,given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H.Prior probability of h,P(h):it reflects any background knowledge we have about the chance that h is a correct hypothesis(before having observed the data).Prior probability of D,P(D):it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds.Conditional Probability of observation D,P(D|h):it denotes the probability of observing data D hypothesis h.Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络22Posterior probability of h,P(h|D):it represents the probability that h holds given the observed training data D.It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in.Bayes Theorem allows us to compute P(h|D):P(h|D)=P(D|h)P(h)/P(D)Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络23Goal:To find the most probable hypothesis h from a set of candidate hypotheses H given the observed data D.MAP Hypothesis,hMAP=argmax h H P(h|D)=argmax h H P(D|h)P(h)/P(D)=argmax h H P(D|h)P(h)If every hypothesis in H is equally probable a priori,we only need to consider the likelihood of the data D given h,P(D|h).Then,hMAP becomes the Maximum Likelihood,hML=argmax h H P(D|h)P(h)Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络24给定：m个类，训练样本和未知数据目标：给每个输入数据标记一个类属性两个阶段：建模/学习：基于训练样本学习分类规则.分类/测试：对输入数据应用分类规则P(f1)f1鹅卵石救命稻草杆Pebbles StrawspebblesStrawsf2f1决策边界Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络25什么是最优分类器?已有:类条件概率密度函数This is called the class-conditional probability describing the probability of occurrence of the features on category.欲求:后验概率make a decision that maximize the conditional probability of the object,given certain feature measurements.Also called posterior probability function.p(x|1)p(x|2)类条件概率密度函数p(1|x)后验概率p(2|x)Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络26MAP决策:以后验概率为判决函数:Choose category/class that has the maximumThis produces the optimal performance:minimum probability of error:A classifier that achieves this optimal performance is called Bayesian classifier.Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络27Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络28决策的风险：做决策要考虑决策可能引起的损失。以医生根据白细胞浓度判断一个人是否患血液病为例：没病(1)被判为有病(2)，还可以做进一步检查，损失不大；有病(2)被判为无病(1)，损失严重。Decision Risk tableThe risk to make a decision :classify x(belong to class j)to class i,so:Decision Rule:Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络29基于Bayes决策的最优分类器Bayes决策的三个前提：类别数确定各类的先验概率P(Ci)已知各类的条件概率密度函数p(x|Ci)已知问题的转换：基于样本估计P(Ci)和p(x|Ci)基于样本直接确定判别函数学习问题Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络30类的先验概率P(Ci)的估计：用训练数据中各类出现的频率估计依靠经验类条件概率密度p(x|Ci)估计的两种主要方法：参数估计：概率密度函数的形式已知，而表征函数的参数未知，通过训练数据来估计最大似然估计最大后验估计非参数估计：密度函数的形式未知，也不作假设，利用训练数据直接对概率密度进行估计Parzen窗法Bayesian decision第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决贝叶斯决策策3.3 朴素贝叶斯方法3.4 贝叶斯网络31Bayesian estimationBayesian learning considers (the parameter vector to be estimated)to be a random variable.Before we observe the data,the parameters are described by a prior which is typically very broad.Once we observed the data,we can make use of Bayes formula to find posterior.Since some values of the parameters are more consistent with the data than others,the posterior is narrower than prior.This is Bayesian learning 第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络32第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络Bayesian estimation33Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making:choose the value of which minimizes the riskNote that the loss function is usually continuous.Bayesian estimation第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络34Example 1:(q is unidimensional).The total Bayesian risk here:We seek its minimum:At the which is a solution we haveThat is,for the the optimal Bayesian estimator for the parameter is the median of the distribution第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络Bayesian estimation35Example 2:(squared error).Total Bayesian risk:Again,in order to find the minimum,let the derivative be equal 0:The optimal estimator here is the conditional expectation of given the data X(n).q第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络Bayesian estimation36Maximum A-Posteriori(MAP)Estimation第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络37So,the we are looking for is Maximum A-Posteriori(MAP)EstimationRegularizationLikelihoodIn MAP estimator,the larger n(the size of the data),the less important the prior is.It can motivate us to omit the prior.第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络38Example 3:(0-1 loss)HenceMaximum A-Posteriori(MAP)Estimation第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络39Density function for x,given the training data set From the definition of conditional probability densitiesThe first factor is independent of X(n)since it just our assumed form for parameterized density.Therefore Bayesian learning第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络40Instead of choosing a specific value for ,the Bayesian approach performs a weighted average over all values of .If the weighting factor ,which is a posterior of peaks very sharply about some value we obtain .Bayesian learning第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络41Prediction based on three estimations第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估贝叶斯估计与预测计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络42朴素贝叶斯学习模型（NB）将训练实例表示成属性(特征)向量A和决策类别变量C。假定特征向量的各分量间相对于决策变量是相对独立的，也就是说各分量独立地作用于决策变量。降低了学习的复杂性在许多领域，表现出相当的健壮性和高效性NB的特点结构简单只有两层结构推理复杂性与网络节点个数呈线性关系Ca1a2an-1anNave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络43NB假设：设样本A表示成属性向量，如果属性ak对于给定的类别独立，那么P(A|Ci)可以分解成几个分量的积：简单贝叶斯分类(SBC:Simple Bayesian Classification)一般认为，只有在独立性假定成立的时候，SBC才能获得精度最优的分类效率；或者在属性相关性较小的情况下，能获得近似最优的分类效果。Nave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络44R Saw“Return of the King”more than onceZ Live in zipcode 15213C Brought Coat to ClassroomJ Person is a JuniorWhat parameters arestored in the CPTs ofthis Bayes Net?Nave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络45R Saw“Return of the King”more than onceZ Live in zipcode 15213C Brought Coat to ClassroomJ Person is a JuniorP(J)=P(C|J)=P(C|J)=P(Z|J)=P(Z|J)=P(R|J)=P(R|J)=Suppose we have a database from 20 people who attended a lecture.How could we use that to estimate the values in this CPT?Nave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络46R Saw“Return of the King”more than onceZ Live in zipcode 710000C Brought Coat to ClassroomJ Person is a JuniorP(J)=P(C|J)=P(C|J)=P(Z|J)=P(Z|J)=P(R|J)=P(R|J)=A new person shows up at class wearing an“I live in Baoji city where I saw all the Lord of The Rings Movies every night”overcoat.What is the probability that they are a Junior?Input AttributesOuput AttributesNave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络47Nave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络481.Estimate P(Y=v)as fraction of records with Y=v2.Estimate P(Xi=u|Y=v)as fraction of“Y=v”records that alsohave X=u.3.To predict the Y value given observations of all the Xi values,computeThe General CaseNave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络49Nave Bayesian classifier第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方朴素贝叶斯方法法3.4 贝叶斯网络50124536145263通俗的说，图是由一些点和连接这些点的线组成的。Node,edge图论主要在离散数学中使用，应用于计算机相关科学图描述复杂系统成分之间的关系。图G有两个集合组成：非空的结点集V和有限的边集E若集合V中的结点组成点对，或者说是图的边则称这两个节点邻接邻接，否则称为非邻接非邻接。Bayesian network第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络51有向图(directed graph)非空有穷集合V，称G=(V,E)为一个有向图。其中V中的元素称为顶点顶点(vertex)，V称为顶点集顶点集；是有方向的，称为从顶点v1到顶点v2的有有向边向边，或者叫做弧弧(arc)，E为有向边集；有向边集；v1称为前导，v2称为后继 124536V=1,2,3,4,5,6,7，|V|=7E=(1,2),(2,2),(2,4),(4,5),(4,1),(5,4),(6,3)，|E|=7自循环7孤立点Bayesian network第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络52无向图(undirected graph)非空有穷集合V，称G=(V,E)为一个有向图。其中V中的元素称为顶点顶点，V称为顶点集顶点集；是无方向的，称从无向边无向边，E为无向边集；无向边集；和是一样的 ADEFBCV=A,B,C,D,E,F|V|=6E=A,B,A,E,B,E,C,F|E|=4Bayesian network第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络53Directed acyclic graphNodes are variables(discrete or continuous)Arcs indicate dependence between variablesConditional Probabilities(local distributions)Missing arcs implies conditional independenceIndependencies+local distributions=joint distributionX1X2X3Bayesian network第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络54WetGrassCloudyRainSprinklerBayesian network(example)55变量-阴天(C),下雨(R),洒水(S),草地湿(W)CRSWProb.FFFF0.01FFFT0.04FFTF0.05FFTT0.01FTFF0.02FTFT0.07FTTF0.2FTTT0.1TFFF0.01TFFT0.07TFTF0.13TFTT0.04TTFF0.06TTFT0.05TTTF0.1TTTT0.0524-1 独立参数模型对草地是否是潮湿的进行建模如果下列条件满足，草地是湿的下雨洒水车洒水是否下雨依赖于是否阴天如果阴天则不太可能洒水Bayesian network(example)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络56WetGrassCloudyRainSprinklerWSRFTFF10FT0.10.9TF0.10.9TT0.010.99SCFTF0.50.5T0.90.1RCFTF0.80.2T0.20.8CFT0.50.59 独立参数Bayesian network(example)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络57You have a new burglar alarm installedIt is reliable about detecting burglary,but responds to minor earthquakesTwo neighbors(John,Mary)promise to call you at work when they hear the alarmJohn always calls when hears alarm,but confuses alarm with phone ringing(and calls then also)Mary likes loud music and sometimes misses alarm!Given evidence about who has and hasnt called,estimate the probability of a burglaryBayesian network(example)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络58Im at work,John calls to say my alarm is ringing,Mary doesnt call.Is there a burglary?5 Variables Network topology reflects causal knowledgeBayesian network(example)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络59l In order for a Bayesian network to model a probability distribution,the following must be true by definition:Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.l This impliesBayesian network第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络60Global and local semanticsGlobal semantics defines the full joint distribution as the product of the local conditional distributionsFor defining this product,a linear ordering of the nodes of the network has to be given:X1 Xn P(X1 Xn)=ni=1 P(Xi|Parents(Xi)ordering in the example:B,E,A,J,MP(J M A B E)=P(B)P(E)P(A|B E)P(J|A)P(M|A)第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络61Local semantics defines a series of statements of conditional independenceEach node is conditionally independent of its nondescendants given its parents:I(X,Nondescentents(X)|Parents(X)ExamplesX Y Z I(X,Y)?I(X,Z)?X Y Z I(X,Z|Y)?X Y Z I(X,Y)?I(X,Z)?Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络62Another important independence property is implied by the topological semantics:a node is conditionally independent of all other nodes in the network,given its parents,children,and childrens parents that is,given its Markov blanket.Markov blanketGlobal and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络63If a local semantics in form of the independeny statements is given,i.e.I(X,Nondescendants(X)|Parents(X)for each node X of the network,then the global semantics results:P(X1 Xn)=ni=1 P(Xi|Parents(Xi),and vice versa.For proving local semantics global semantics,we assume an ordering of the variables that makes sure that parents appear earlier in the ordering:Xi is parent of Xj then Xi Xj Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络64P(X1,Xn)=ni=1 P(Xi|X1,Xi 1)chain ruleParents(Xi)X1,Xi 1P(Xi|X1,Xi 1)=P(Xi|Parents(Xi),Rest)local semantics:I(X,Nondescendants(X)|Parents(X)The elements of Rest are nondescendants of Xi,hence we can skip RestHence,P(X1 Xn)=ni=1 P(Xi|Parents(Xi),Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络65Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络66Assume:Then Bayess ruleChain ruleBy assumptionBayess ruleGlobal and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络67Let I(X,Z|Y)represent X and Z being conditionallyindependent given Y.Yes,just as in previous example:All Xs parents given,and Z is not a descendant.I(X,Z|Y)?zUXVI(X,Z|U)?I(X,Z|U,V)?Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络68 X has no parents,so were know all its parents values trivially Z is not a descendant of X What if we do know the value of Y,though?Or one of its descendants?Global and local semantics第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络691.Choose an ordering of variables X1,Xn2.For i=1 to nadd Xi to the networkselect parents from X1,Xi 1 such that p(Xi|Parents(Xi)=p(Xi|X1,Xi 1)This choice guarantees the global semantics:p(X1,Xn)=ni=1 p(Xi|X1,Xi 1)(chain rule)=ni=1 p(Xi|Parents(Xi)by constructionConstructing belief networks第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络70What is an appropriate ordering?In principle,each ordering is allowed!heuristic rule:start with causes,go to direct effects(B,E),A,(J,M)4 possible orderings Constructing belief networksEarthquake example第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络71Suppose we chose the ordering M,J,A,B,EP(J|M)=P(J)?P(A|J,M)=P(A|J)?P(A|J,M)=P(A)?P(B|A,J,M)=P(B|A)?P(B|A,J,M)=P(B)?P(E|B,A,J,M)=P(E|A)?P(E|B,A,J,M)=P(E|A,B)?MaryCallsJohnCallsAlarmBurglaryEarthquakeNoNoYesNoYesNoConstructing belief networks第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络72Types of inference:Q query variable,E evidence variable诊断预测因果推理WetGrassCloudyRainSprinklerTask of inference is to compute the posterior probability distribution for a set of query variables,given exact values for some evidence variables.Inference第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络73If E are the evidence(observed)variables and Y are the other(unobserved)variables,then:P(X|E)=P(X,E)=Y P(X,E,Y)Sum all of the terms(atomic event probabilities)from the full joint distributionEach P(X,E,Y)term can be computed using the chain ruleComputationally expensive!Inference by enumeration第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络74P(xi)=i P(xi|i)P(i)Suppose we want P(B=true),and only the value of J and M are given as trueInference by enumerationExampleBEAJM第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络75Inference by enumeration第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络76smartstudypreparedfairpassp(smart)=.8p(study)=.6p(fair)=.9p(prep|)smartsmartstudy.9.7study.5.1p(pass|)smartsmartprepprepprepprepfair.9.7.7.2fair.3.2.05.01Query:What is the probability that a student studied,given that he/she passed the exam?Inference by enumeration第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.3 朴素贝叶斯方法3.4 贝叶斯网络贝叶斯网络77Basically just enumeration,but with caching of local calculationsLinear for polytrees(singly connected BNs)Potentially exponential for multiply connected BNsExact inference in Bayesian networks is NP-hard!Junction tree algorithms are an extension of variable elimination methods that compute posterior probabilities for all nodes in a BN simultaneouslyInference by Variable elimination第三章贝叶斯学习方法3.1 极大似然估计3.2 贝叶斯学习 3.2.1 贝叶斯决策 3.2.2 贝叶斯估计与预测3.

展开阅读全文

贝叶斯学习综述课件

最新文档