KubeFlow安装指南
【摘要】 Kubeflow是Google推出的基于kubernetes環境下的機器學習組件,通過Kubeflow可以實現對TFJob等資源類型定義,可以像部署應用一樣完成在TFJob分布式訓練模型的過程。
?
?
組件
http://pachyderm.io/?
http://www.argoproj.io/
Kubeflow使用場景
-
希望訓練tensorflow模型且可以使用模型接口發布應用服務在k8s環境中(eg.local,prem,cloud)
-
希望使用Jupyter notebooks來調試代碼,多用戶的notebook server
-
在訓練的Job中,需要對的CPU或者GPU資源進行調度編排
-
希望Tensorflow和其他組件進行組合來發布服務
依賴庫
-
ksonnet 0.11.0以上版本 /可以直接從github上下載,scp ks文件到usr/local/bin
-
kubernetes 1.8以上(直接使用CCE服務節點,需要創建一個CCE集群和若干節點,并為某個節點綁定EIP)
-
kubectl tools
?1、安裝ksonnet
?ksonnet 安裝過程,可以去網址里面查看ks最新版本
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.0/ks_0.13.0_linux_amd64.tar.gz tar -vxf ks_0.13.0_linux_amd64.tar.gz cd -vxf ks_0.13.0_linux_amd64 sudo cp ks /usr/local/bin安裝完成后
2、安裝kubectl工具
wget https://cce-storage.obs.cn-north-1.myhwclouds.com/kubectl.zip yum install unzip unzip kubectl.zip cp kubectl /usr/local/bin/ #在集群頁面查看kubectl工具 #下載并復制下圖中的config文件內容 mkdir /root/.kube/ touch /root/.kube/config vi /root/.kube/config #黏貼內容 :wq!保存 #因為這邊節點已經綁定了EIP,直接選擇集群內訪問即可 kubectl config use-context internal安裝成功后執行kubectl version查看版本信息是否符合要求
3、安裝kubeflow
配置參數介紹
-
KUBEFLOW_SRC:下載文件存放的目錄
-
KUBEFLOW_TAG:代碼分支的tag,如Master(當前還有一個v0.2分支)
-
KFAPP:用來保存deployment、tfjob等應用的目錄,ksonnet app會保存在${KFAPP}/ks_app目錄下面
-
KUBEFLOW_REPO: kubeflow倉庫路徑,可以自己load下來,放到指定位置。具體做法:curl -L -o /root/kubeflow/kubeflow.tar.gz https://github.com/kubeflow/kubeflow/archive/master.tar.gz;tar -xzvf kubeflow.tar.gz;KUBEFLOW_REPO=/root/kubeflow/kubeflow-master/
安裝過程
mkdir ${KUBEFLOW_SRC} cd ${KUBEFLOW_SRC} export KUBEFLOW_TAG=<version> curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash #如果要自行安裝可以看下download.sh 這個腳本${KUBEFLOW_REPO}/scripts/kfctl.sh init ${KFAPP} --platform none #初始化,會生成一個KFAPP目錄 cd ${KFAPP} ${KUBEFLOW_REPO}/scripts/kfctl.sh generate k8s #生成需要的component ${KUBEFLOW_REPO}/scripts/kfctl.sh apply k8s #生成對應的資源4、查看當前啟動的工作負載
-
Argo?基于K8s的工作流引擎? https://argoproj.github.io/
-
Ambassador? API Gateway??https://www.getambassador.io/
-
tf-operator? ?https://github.com/kubeflow/tf-operator/blob/master/developer_guide.md
對外暴露了的workflow和jupyter notebook,提供可視化的交互
5、下載kubeflow官方的example:基于TensorFlow的分布式CNN模型訓練
執行腳本
CNN_JOB_NAME=mycnnjob VERSION=v0.2-branch KS_APP=cnnjob KF_ENV=default ks init ${KS_APP} cd ${KS_APP} ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow ks pkg install kubeflow-git/examples ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME} ks apply ${KF_ENV} -c ${CNN_JOB_NAME}| # |
查看定義TFjob資源
啟動了一個ps(相當于參數服務器)和worker(相當于計算服務器)
tfjob的yaml定義
apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata:annotations:ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/+yRTWvcTAzH78/H0Nle7zpPaGPwqSWUHtqlCc2hBCOPZe/U84ZGs8Fd9ruXcUi3L5+g0DkM0l8aifn/ToBBfyaO2jtoYE49jcY/bTxP1XGHJhywhgJm7QZo4P72ve+hAEuCAwpCcwKDPZmYozl650g22lfK2+AdOYEG7KKc++p7OBfg0NLP0rMSA6osDzRiMpIbYyCVZ8r4iYLRCu8CqXXLfr2FbDAolOOXXuWdoHbEEZovJ0CecgBhkYN3UICMnXKu68mpg0We4yYsUEBZ9ijq0EX9jdqrelWsH8i0TNGRXG9X6YissTfUpTCgUBuQ0ZIQd5H4SLw2jSbFQxdl8Ela4USr6pLtppBiu1tT4xWa7vJ+oKNW1KqQ1vLvKQp2o2eL0n549/AGHgvQFqds2KQ4u/1CrZKxvPyuVCE1x3q7e7Wrt3XZD9hf39yUg2ZZyvr1/zhewQ8iQi56zkOggCfPs3bTW83QQOWD/Dq4iop1kFj9YSicHwtgioIse2+0WqCBj+4WtUlMcD4XF6D3S8iL93cZ94PnmThD5OdqhGZX/KP8d1LO57/vAAAA//8BAAD//zMKTHZaBAAA"}'clusterName: ""creationTimestamp: 2018-09-26T15:33:07Zgeneration: 0labels:app.kubernetes.io/deploy-manager: ksonnetksonnet.io/component: mycnnjobname: mycnnjobnamespace: defaultresourceVersion: "2293964"selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/mycnnjobuid: 777da1bb-c1a1-11e8-8661-fa163e4006b8 spec:cleanPodPolicy: RunningtfReplicaSpecs:PS:replicas: 1restartPolicy: Nevertemplate:metadata:creationTimestamp: nullspec:containers:- args:- python- tf_cnn_benchmarks.py- --batch_size=32- --model=resnet50- --variable_update=parameter_server- --flush_stdout=true- --num_gpus=1- --local_parameter_device=cpu- --device=cpu- --data_format=NHWCimage: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3name: tensorflowports:- containerPort: 2222name: tfjob-portresources: {}workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarksrestartPolicy: OnFailureWorker:replicas: 1restartPolicy: Nevertemplate:metadata:creationTimestamp: nullspec:containers:- args:- python- tf_cnn_benchmarks.py- --batch_size=32- --model=resnet50- --variable_update=parameter_server- --flush_stdout=true- --num_gpus=1- --local_parameter_device=cpu- --device=cpu- --data_format=NHWCimage: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3name: tensorflowports:- containerPort: 2222name: tfjob-portresources: {}workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarksrestartPolicy: OnFailure status:completionTime: 2018-09-26T15:33:58Zconditions:- lastTransitionTime: 2018-09-26T15:33:07ZlastUpdateTime: 2018-09-26T15:33:07Zmessage: TFJob mycnnjob is created.reason: TFJobCreatedstatus: "True"type: Created- lastTransitionTime: 2018-09-26T15:33:07ZlastUpdateTime: 2018-09-26T15:33:10Zmessage: TFJob mycnnjob is running.reason: TFJobRunningstatus: "False"type: Running- lastTransitionTime: 2018-09-26T15:33:07ZlastUpdateTime: 2018-09-26T15:33:58Zmessage: TFJob mycnnjob is failed.reason: TFJobFailedstatus: "True"type: FailedstartTime: 2018-09-26T15:33:10ZtfReplicaStatuses:Chief: {}Master: {}PS: {}Worker: {}6、分布式TensorFlow編程
-
具體參考https://www.tensorflow.org/deploy/distributed
-
樣例:使用worker和parameter servers進行矩陣點乘運算?https://github.com/tensorflow/k8s/tree/master/examples/tf_sample
?
?
?
| 1 2 3 4 5 6 7 8 | for job_name in cluster_spec.keys():for i in range(len(cluster_spec[job_name])):d = "/job:{0}/task:{1}".format(job_name, i)with tf.device(d):a = tf.constant(range(width * height), shape=[height, width])b = tf.constant(range(width * height), shape=[height, width])c = tf.multiply(a, b)results.append(c) |
-
tf.train.ClusterSpec為這個集群內部的master調度運算的信息。
?
(1、ps類型的job定義需要計算的參數值,worker對參數值進行計算,2、參數的更新以及傳輸,ps向worker發送更新后的參數,worker返回給ps計算好的梯度值)
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | with tf.device("/job:ps/task:0"):weights_1 = tf.Variable(...)biases_1 = tf.Variable(...)with tf.device("/job:ps/task:1"):weights_2 = tf.Variable(...)biases_2 = tf.Variable(...)with tf.device("/job:worker/task:7"):input, labels = ...layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)# ...train_op = ...with tf.Session("grpc://worker7.example.com:2222") as sess:for _ in range(10000):sess.run(train_op) |
總結
以上是生活随笔為你收集整理的KubeFlow安装指南的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: node.js入门小案例
- 下一篇: 华为主题包hwt下载_华为主题 | 黑白