GBDT算法及其应用

资源描述

Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,Company Logo,LOGO,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,Company Logo,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,01/26/16,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Click to edit the title text format,Click to edit the outline text format,Second Outline Level,Third Outline Level,Fourth Outline Level,Fifth Outline Level,Sixth Outline Level,Seventh Outline Level,*,Gradient Boosting Decision Tree,And Its Application,班级：,*,学生：,*,学号：,*,报告大纲,第一部分：引言（概念介绍）,决策树,boosting,方法,损失函数,GBDT,定义,第二部分：,GBDT,算法原理,加法模型,前向分步算法,提升树算法,梯度提升树算法,Regularization,第三部分：,GBDT,应用,应用范围,实例：,CTR,预估,GBDT,特征转换,LR+GBDT,第四部分：总结,第一部分：概念介绍,决策树,boost,方法,损失函数,GBDT,定义,第一部分：概念介绍,决策树：,是将空间用超平面进行划分的一种方法,分类树,回归树,单决策树时间复杂度较低，模型容易展示，但容易,over-fitting,决策树的,boost,方法：,是一个迭代的过程，每一次新的训练都是为了改进上一次的结果,.,传统,Boost,：对正确、错误的样本进行加权，每一步结束后，增加分错的点的权重，减少分对的点的权重。,GB,：梯度迭代,Gradient Boosting,，每一次建立模型是在之前建立的模型损失函数的梯度下降方向,第一部分：概念介绍,损失函数,(loss function),：描述的是模型的不靠谱程度，损失函数越大，则说明模型越容易出错。,对于不同的,Loss function,，其梯度有不同的表达式：,第一部分：概念介绍,GBDT(Gradient Boosting Decision Tree),：是一种迭代的决策树算法，该算法由多棵决策树组成，所有树的结论累加起来做最终结果。,GBDT,这个算法还有一些其他的名字，,MART(Multiple Additive Regression Tree),，,GBRT(Gradient Boost Regression Tree),，,Tree Net,，,Treelink,等。,第二部分：,GBDT,算法原理,加法模型,前向分步算法,提升树算法,梯度提升树算法,Regularization,第二部分：,GBDT,算法原理,提升树利用加法模型与前向分布算法实现学习的优化过程。,第二部分：,GBDT,算法原理,前向分布算法,第二部分：,GBDT,算法原理,对于决策树，可以表示为：,其中参数表示树的区域划分和各区域上的常数,回归问题提升树使用以下前向分步算法,所以，对于回归问题的提升树算法，,只需简单拟合当前模型的残差。,第二部分：,GBDT,算法原理,第二部分：,GBDT,算法原理,当损失函数是平方损失和指数损失函数时，每一步优化是简单的，但对一般损失函数而言，并不简单。,Freidman,提出了,Gradient Boosting,算法，利用最速下降法的近似方法，其关键是利用损失函数的负梯度在当前模型的值,作为回归问题提升树算法中的残差的近似值，拟合一个回归树。,Stochastic Gradient Boosting,当,N,很大的时候，非常耗费时间，这时我们可以从中随机选取一些数据来拟合。,第二部分：算法原理,第二部分：,GBDT,算法原理,Regularization,cross validation,Shrinkage,参数,v(0v1),可以认为是,boosting,方法的学习速率。如果使用很小的,v,，要达到相当的训练误差，就需要使用较大的,M,。反之亦然。在通常情况下，较小的,v,在独立测试集上的,performance,更加好，但是这时需要较大的,M,，比较耗时。,Subsampling,使用前面提到的,stochastic gradient boosting,不仅减少了训练时间，同样可以起到,bagging,的效果，因为每次随机抽样减小了,overfitting,的机会。,第三部分：,GBDT,应用,应用范围,实例：,CTR,预估,LR,GBDT,特征转换,LR+GBDT,第三部分：,GBDT,应用,应用范围,GBDT,几乎可用于所有回归问题（线性,/,非线性）,亦可用于二分类问题（设定阈值，大于阈值为正例，反之为负例）；不太适合做多分类问题；,排序问题；,常用于各大数据挖掘竞赛（模型融合）；,广告推荐,第三部分：,GBDT,应用,CTR,预估：广告点击率（,Click-Through Rate Prediction,）,CTR,预估中用的最多的模型是,LR,（,Logistic Regression,），,LR,是广义线性模型，与传统线性模型相比，,LR,使用了,Logit,变换将函数值映射到,01,区间，映射后的函数值就是,CTR,的预估值。,LR,，逻辑回归模型，这种线性模型很容易并行化，处理上亿条训练样本不是问题，但线性模型学习能力有限，需要大量特征工程预先分析出有效的特征、特征组合，从而去间接增强,LR,的非线性学习能力。,第三部分：,GBDT,应用,LR,模型中的特征组合很关键，但又无法直接通过特征笛卡尔积解决，只能依靠人工经验，耗时耗力同时并不一定会带来效果提升。如何自动发现有效的特征、特征组合，弥补人工经验不足，缩短,LR,特征实验周期，是亟需解决的问题,Facebook 2014,年的文章介绍了通过,GBDT,（,Gradient Boost Decision Tree,）解决,LR,的特征组合问题，随后,Kaggle,竞赛也有实践此思路,GDBT+FM,，,GBDT,与,LR,融合开始引起了业界关注,第三部分：,GBDT,应用,GBDT+LR,GBDT,的思想使其具有天然优势，可以发现多种有区分性的特征以及特征组合，决策树的路径可以直接作为,LR,输入特征使用，省去了人工寻找特征、特征组合的步骤。,第三部分：,GBDT,应用,由于树的每条路径，是通过最小化均方差等方法最终分割出来的有区分性路径，根据该路径得到的特征、特征组合都相对有区分性，效果理论上不会亚于人工经验的处理方式。,第三部分：,GBDT,应用,实验,Kaggle,比赛,:Display Advertising Challenge,详细介绍：, J H.,Greedy function approximation: a gradient boosting machine,J. Annals of statistics, 2001: 1189-1232.,Friedman J H.,Stochastic gradient boostingJ,. Computational Statistics & Data Analysis, 2002, 38(4): 367-378.,He X, Pan J, Jin O, et al.,Practical Lessons from Predicting Clicks on Ads at Facebook,C/ Eighth International Workshop on Data Mining for Online Advertising. ACM, 2014:1-9.,Yuan T T, Chen Z, Mathieson M.,Predicting eBay listing conversion,C/Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011: 1335-1336.,Tyree S, Weinberger K Q, Agrawal K, et al.,Parallel boosted regression trees for web search ranking,C/Proceedings of the 20th international conference on World wide web. ACM, 2011: 387-396., you!,

展开阅读全文

GBDT算法及其应用

最新文档