SparkMLlib回归算法之决策树
SparkMLlib回歸算法之決策樹
(一),決策樹概念
1,決策樹算法(ID3,C4.5 ,CART)之間的比較:
1,ID3算法在選擇根節(jié)點和各內(nèi)部節(jié)點中的分支屬性時,采用信息增益作為評價標準。信息增益的缺點是傾向于選擇取值較多的屬性,在有些情況下這類屬性可能不會提供太多有價值的信息。
2 ID3算法只能對描述屬性為離散型屬性的數(shù)據(jù)集構(gòu)造決策樹,其余兩種算法對離散和連續(xù)都可以處理
2,C4.5算法實例介紹(參考網(wǎng)址:http://m.blog.csdn.net/article/details?id=44726921)
?c4.5后剪枝策略:以悲觀剪枝為主參考網(wǎng)址:http://www.cnblogs.com/zhangchaoyang/articles/2842490.html
(二) SparkMLlib決策樹回歸的應(yīng)用
1,數(shù)據(jù)集來源及描述:參考http://www.cnblogs.com/ksWorld/p/6891664.html
2,代碼實現(xiàn):
2.1 構(gòu)建輸入數(shù)據(jù)格式:
val file_bike = "hour_nohead.csv"val file_tree=sc.textFile(file_bike).map(_.split(",")).map{x =>val feature=x.slice(2,x.length-3).map(_.toDouble)val label=x(x.length-1).toDoubleLabeledPoint(label,Vectors.dense(feature))}println(file_tree.first())val categoricalFeaturesInfo = Map[Int,Int]()val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)2.2 模型評判標準(mse,mae,rmsle)
val predict_vs_train = file_tree.map {point => (model_DT.predict(point.features),point.label)/* point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))*/}predict_vs_train.take(5).foreach(println(_))/*MSE是均方誤差*/val mse = predict_vs_train.map(x => math.pow(x._1 - x._2, 2)).mean()/* 平均絕對誤差(MAE)*/val mae = predict_vs_train.map(x => math.abs(x._1 - x._2)).mean()/*均方根對數(shù)誤差(RMSLE)*/val rmsle = math.sqrt(predict_vs_train.map(x => math.pow(math.log(x._1 + 1) - math.log(x._2 + 1), 2)).mean())println(s"mse is $mse and mae is $mae and rmsle is $rmsle") /* mse is 11611.485999495755 and mae is 71.15018786490428 and rmsle is 0.6251152586960916 */(三)?改進模型性能和參數(shù)調(diào)優(yōu)
1,改變目標量 (對目標值求根號),修改下面語句
LabeledPoint(math.log(label),Vectors.dense(feature)) 和val predict_vs_train = file_tree.map {/*point => (model_DT.predict(point.features),point.label)*/point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))} /*結(jié)果 mse is 14781.575988339053 and mae is 76.41310991122032 and rmsle is 0.6405996100717035 */決策樹在變換后的性能有所下降
2,模型參數(shù)調(diào)優(yōu)
1,構(gòu)建訓練集和測試集
val file_tree=sc.textFile(file_bike).map(_.split(",")).map{x =>val feature=x.slice(2,x.length-3).map(_.toDouble)val label=x(x.length-1).toDoubleLabeledPoint(label,Vectors.dense(feature))/*LabeledPoint(math.log(label),Vectors.dense(feature))*/}val tree_orgin=file_tree.randomSplit(Array(0.8,0.2),11L)val tree_train=tree_orgin(0)val tree_test=tree_orgin(1)2,調(diào)節(jié)樹的深度參數(shù)
val categoricalFeaturesInfo = Map[Int,Int]()val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)/*調(diào)節(jié)樹深度次數(shù)*/val Deep_Results = Seq(1, 2, 3, 4, 5, 10, 20).map { param =>val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",param,32)val scoreAndLabels = tree_test.map { point =>(model.predict(point.features), point.label)}val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)(s"$param lambda", rmsle)} /*深度的結(jié)果輸出*/Deep_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 1 lambda, rmsle = 1.0763369409492645 2 lambda, rmsle = 0.9735820606349874 3 lambda, rmsle = 0.8786984993014815 4 lambda, rmsle = 0.8052113493915528 5 lambda, rmsle = 0.7014036913077335 10 lambda, rmsle = 0.44747906135994925 20 lambda, rmsle = 0.4769214752638845 */深度較大的決策樹出現(xiàn)過擬合,從結(jié)果來看這個數(shù)據(jù)集最優(yōu)的樹深度大概在10左右
3,調(diào)節(jié)劃分數(shù)
/*調(diào)節(jié)劃分數(shù)*/val ClassNum_Results = Seq(2, 4, 8, 16, 32, 64, 100).map { param =>val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",10,param)val scoreAndLabels = tree_test.map { point =>(model.predict(point.features), point.label)}val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)(s"$param lambda", rmsle)}/*劃分數(shù)的結(jié)果輸出*/ClassNum_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 2 lambda, rmsle = 1.2995002615220668 4 lambda, rmsle = 0.7682777577495858 8 lambda, rmsle = 0.6615110909041817 16 lambda, rmsle = 0.4981237727958235 32 lambda, rmsle = 0.44747906135994925 64 lambda, rmsle = 0.4487531073836407 100 lambda, rmsle = 0.4487531073836407 */更多的劃分數(shù)會使模型變復雜,并且有助于提升特征維度較大的模型性能。劃分數(shù)到一定程度之后,對性能的提升幫助不大。實際上,由于過擬合的原因會導致測試集的性能變差。可見分類數(shù)應(yīng)在32左右。。
?
轉(zhuǎn)載于:https://www.cnblogs.com/ksWorld/p/6899594.html
總結(jié)
以上是生活随笔為你收集整理的SparkMLlib回归算法之决策树的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【中文分词】隐马尔可夫模型HMM
- 下一篇: code第一部分:数组