pyspark pipline
生活随笔
收集整理的這篇文章主要介紹了
pyspark pipline
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
關于pip line的介紹
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Jun 7 16:49:03 2018@author: luogan """from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import HashingTF, Tokenizerfrom pyspark.sql import SparkSessionspark= SparkSession\.builder \.appName("dataFrame") \.getOrCreate()# Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([(0, "a b c d e spark", 1.0),(1, "b d", 0.0),(2, "spark f g h", 1.0),(3, "hadoop mapreduce", 0.0) ], ["id", "text", "label"])# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])# Fit the pipeline to training documents. model = pipeline.fit(training)# Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([(4, "spark i j k"),(5, "l m n"),(6, "spark hadoop spark"),(7, "apache hadoop") ], ["id", "text"])# Make predictions on test documents and print columns of interest. prediction = model.transform(test) selected = prediction.select("id", "text", "probability", "prediction") for row in selected.collect():rid, text, prob, prediction = rowprint("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction)) (4, spark i j k) --> prob=[0.15554371384424398,0.844456286155756], prediction=1.000000 (5, l m n) --> prob=[0.8307077352111738,0.16929226478882617], prediction=0.000000 (6, spark hadoop spark) --> prob=[0.06962184061952888,0.9303781593804711], prediction=1.000000 (7, apache hadoop) --> prob=[0.9815183503510166,0.018481649648983405], prediction=0.000000總結
以上是生活随笔為你收集整理的pyspark pipline的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: pyspark 计算 皮尔逊相关系数
- 下一篇: 如何优雅的实现pandas DataF