當前位置：首頁 > 运维知识 > windows >内容正文

windows

maven依赖 spark sql_window环境运行spark-xgboost 8.1踩到的坑

發布時間：2023/12/10 windows 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 maven依赖 spark sql_window环境运行spark-xgboost 8.1踩到的坑小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在window 環境下使用spark - xgboost會出現一些問題，這里記錄一下。

環境：window 7 + spark 2.31 + xgboost 8.1 + idea + maven

一.依賴以及代碼

數據集下載地址

UCI Machine Learning Repository: Iris Data Set?archive.ics.uci.edu

pom依賴

<dependency><groupId>ml.dmlc</groupId><artifactId>xgboost4j</artifactId><version>0.81</version> </dependency> <dependency><groupId>ml.dmlc</groupId><artifactId>xgboost4j-spark</artifactId><version>0.81</version> </dependency>

測試代碼

import org.apache.spark.ml.feature.{StringIndexer} import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.sql. SparkSession import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostClassifier} /*** author ：wy* todo ： xgboost鳶尾花分類* Created by pc-admin on 2020-03-12 11:21**/ object xgboostIrisDataTest {def main(args: Array[String]): Unit = {val ss = SparkSession.builder().master("local[4]").appName("xgboostRisiDataTest").getOrCreate()val dataPath = "iris.data"val schema = new StructType(Array(StructField("sepal lenght", DoubleType, true),StructField("sepal width", DoubleType, true),StructField("petal lenght", DoubleType, true),StructField("petal width", DoubleType, true),StructField("class", StringType, true)))val rawInput = ss.read.schema(schema).csv(dataPath)// 把字符串class轉換成數字數字classval stringIndexer = new StringIndexer().setInputCol("class").setOutputCol("classIndex").fit(rawInput)// 執行進行轉換,并把原有的字符串class刪除掉val labelTransformed = stringIndexer.transform(rawInput).drop("class")// 將多個字段合并成在一起,組成futureval vectorAssembler = new VectorAssembler().setInputCols(Array("sepal lenght", "sepal width", "petal lenght", "petal width")).setOutputCol("features")//將數據集切分成訓集和測試集val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")val splitXgbInput = xgbInput.randomSplit(Array(0.9, 0.1))val trainXgbInput = splitXgbInput(0)val testXgbInput = splitXgbInput(1)// 注意!!!這個num_workers 必須小于等于 local[4] 線程數,否則會出現程序卡死現象.val xgbParam = Map("eta" -> 0.1f,"max_depth" -> 2,"objective" -> "multi:softprob","num_class" -> 3,"num_round" -> 100,"num_workers" -> 4)// 創建xgboost函數,指定特征向量和標簽val xgbClassifier = new XGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("classIndex")// 開始訓練val xgbClassificationModel: XGBoostClassificationModel = xgbClassifier.fit(trainXgbInput)// 預測val result = xgbClassificationModel.transform(testXgbInput)// 展示 result.show(1000)ss.stop()} }

二.出現的Bug 以及解決方法

1.java.io.FileNotFoundException: File /lib/xgboost4j.dll was not found inside JAR.

進入 $MAVEN_HOMEconfrepositorymldmlc 找到 xgboost4j

找到你使用的版本，這里使用的是8.1，點擊。

用winRAR打開.

發現確實缺少 File /lib/xgboost4j.dll文件。

進入點擊以下鏈接。選擇你使用的版本

criteo-forks/xgboost-jars?github.com

點擊紅框下載jar包。

下載完成后，解壓，你會在lib文件夾下找到這個文件。

用WinRAR打開xgboost4j-8.1.jar之后，把下載的 xgboost4j-0.81-criteo-20180821_2.11-win64.jarlib 中的xgboost4j.dll 直接拉進MAVEN_HOMEonfrepositorymldmlcxgboost4j0.81xgboost4j-8.1.jarbin里

在嘗試運行一下，問題解決。

如果提示文件正在被使用，無法修改，請關閉idea即可。

2. XGBoostModel training failed

Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed atml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:363) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:334) at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139) at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:191)

如果你也出現了這個bug，那么恭喜你，咱們的節奏對上了，這個問題我搞了一下午總結一下網上的幾種說法。

運行環境存在多個scala 和 java 版本
spark版本和xgboost版本不對應，比如xgboost 9.0 必須對應spark 2.4以上版本，xgboost 8.1 必須對應spark 2.31以上版本。

我一一驗證，最后的結論都是不行的，于是一氣之下我重啟了一下計算機，您猜怎么著？？？奇跡的問題解決了。。。

結論：先重啟一下計算機，如果問題解決，你將節省一下午時間。。。

3 . 程序運行卡著不動的情況

出現這種情況就是你在初始化spark master的時候給的線程數小于你的work_number，切記:

master('local[m]')

work_number(n)

一定要 m >= n

三,運行結果

原標簽: classIndex

預測標簽 : prediction

真特喵的不容易~~~

參考資料:

sgboost AIP官方文檔

XGBoost4J-Spark Tutorial (version 0.8+)?xgboost.readthedocs.io

一個情況和我類似的國際友人

https://github.com/dmlc/xgboost/issues/2780?github.com

總結

以上是生活随笔為你收集整理的maven依赖 spark sql_window环境运行spark-xgboost 8.1踩到的坑的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 2016微信还信用卡手续费怎么算
下一篇：光大信用卡支付宝还款多久到账