kubernetes集群使用GPU及安装kubeflow1.0.RC操作步骤
kubernetes集群使用GPU及安裝kubeflow1.0.RC操作步驟
?
Kubeflow使用場(chǎng)景
-
希望訓(xùn)練tensorflow模型且可以使用模型接口發(fā)布應(yīng)用服務(wù)在k8s環(huán)境中(eg.local,prem,cloud)
-
希望使用Jupyter notebooks來(lái)調(diào)試代碼,多用戶的notebook server
-
在訓(xùn)練的Job中,需要對(duì)的CPU或者GPU資源進(jìn)行調(diào)度編排
-
希望Tensorflow和其他組件進(jìn)行組合來(lái)發(fā)布服務(wù)
依賴庫(kù)
-
ksonnet 0.11.0以上版本 /可以直接從github上下載,scp ks文件到usr/local/bin
-
kubernetes 1.8以上(直接使用CCE服務(wù)節(jié)點(diǎn),需要?jiǎng)?chuàng)建一個(gè)CCE集群和若干節(jié)點(diǎn),并為某個(gè)節(jié)點(diǎn)綁定EIP)
-
kubectl tools
?1、安裝ksonnet
?ksonnet 安裝過(guò)程,可以去網(wǎng)址里面查看ks最新版本
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.0/ks_0.13.0_linux_amd64.tar.gz tar -vxf ks_0.13.0_linux_amd64.tar.gz cd -vxf ks_0.13.0_linux_amd64 sudo cp ks /usr/local/bin安裝完成后
?
安裝顯卡驅(qū)動(dòng)
安裝CUDA
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo sudo yum clean all sudo yum -y install nvidia-driver-latest-dkms cuda sudo yum -y install cuda-drivers如缺少gcc依賴,則實(shí)行如下命令
yum install kernel-devel kernel-doc kernel-headers gcc\* glibc\* glibc-\*安裝nvidia驅(qū)動(dòng)
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.orgrpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpmyum install -y kmod-nvidia禁用nouvean
###在GRUB_CMDLINE_LINUX添加 rdblacklist=nouveau 項(xiàng) echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf重啟,查看nouveau是否被禁用成功
lsmod|grep nouv 沒(méi)有任何輸出,則表示nouveau已被禁用查看服務(wù)器顯卡信息
[root@master ~]# nvidia-smi Tue Jan 14 03:46:41 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:18:00.0 Off | 0 | | N/A 29C P8 10W / 70W | 0MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 | | N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+安裝NVIDIA-DOCKER
下載nvidia-docker.repo文件
curl -s -L https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo- 查找NVIDIAdocker版本
- 安裝NVIDIA-docker
docker版本為docker18.09.7.ce,所以安裝下述NVIDIAdocker版本
yum install -y nvidia-docker2 pkill -SIGHUP dockerdnvidia-docker version 可查看已安裝的nvidia docker版本
修改docker runtimes為nvidia-docker
[root@ks-allinone ~]# cat /etc/docker/daemon.json {"default-runtime": "nvidia","runtimes": {"nvidia": {"path": "nvidia-container-runtime","runtimeArgs": []}},"registry-mirrors": ["https://o96k4rm0.mirror.aliyuncs.com"] }重啟docker及k8s
systemctl daemon-reload systemctl restart docker.service systemctl restart kubelet安裝gpushare-scheduler-extender
cd /etc/kubernetes/ curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json cd /tmp/ curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml kubectl create -f gpushare-schd-extender.yaml安裝device-plugin-rabc
kubectl create -f device-plugin-rbac.yaml # rbac.yaml --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata:name: gpushare-device-plugin rules: - apiGroups:- ""resources:- nodesverbs:- get- list- watch - apiGroups:- ""resources:- eventsverbs:- create- patch - apiGroups:- ""resources:- podsverbs:- update- patch- get- list- watch - apiGroups:- ""resources:- nodes/statusverbs:- patch- update --- apiVersion: v1 kind: ServiceAccount metadata:name: gpushare-device-pluginnamespace: kube-system --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata:name: gpushare-device-pluginnamespace: kube-system roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: gpushare-device-plugin subjects: - kind: ServiceAccountname: gpushare-device-pluginnamespace: kube-system安裝device-plugin-ds插件
kubectl create -f device-plugin-ds.yamlapiVersion: extensions/v1beta1 kind: DaemonSet metadata:name: gpushare-device-plugin-dsnamespace: kube-system spec:template:metadata:annotations:scheduler.alpha.kubernetes.io/critical-pod: ""labels:component: gpushare-device-pluginapp: gpusharename: gpushare-device-plugin-dsspec:serviceAccount: gpushare-device-pluginhostNetwork: truenodeSelector:gpushare: "true"containers:- image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-35eccabname: gpushare# Make this pod as Guaranteed pod which will never be evicted because of node's resource consumption.command:- gpushare-device-plugin-v2- -logtostderr- --v=5#- --memory-unit=Miresources:limits:memory: "300Mi"cpu: "1"requests:memory: "300Mi"cpu: "1"env:- name: KUBECONFIGvalue: /etc/kubernetes/kubelet.conf- name: NODE_NAMEvalueFrom:fieldRef:fieldPath: spec.nodeNamesecurityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-plugins 參考 https://github.com/AliyunContainerService/gpushare-scheduler-extender https://github.com/AliyunContainerService/gpushare-device-plugin為共享節(jié)點(diǎn)打上gpushare標(biāo)簽
kubectl label node mynode gpushare=true安裝擴(kuò)展
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.12.1/bin/linux/amd64/kubectl chmod +x ./kubectl sudo mv ./kubectl /usr/bin/kubectlcd /usr/bin/ wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare chmod u+x /usr/bin/kubectl-inspect-gpushare kubectl inspect gpushare ##查看集群GPU使用情況- ?
安裝k8s負(fù)載均衡(v0.8.2)(可選)
wget https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yamlkubectl apply -f metallb.yamlmetallb-config.yaml
apiVersion: v1 kind: ConfigMap metadata:namespace: metallb-systemname: config data:config: |address-pools:- name: defaultprotocol: layer2addresses:- 10.18.5.30-10.18.5.50 kubectl apply -f metallb-config.yaml測(cè)試tensorflow
kubectl apply -f tensorflow.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata:name: tensorflow-gpu spec:replicas: 1template:metadata:labels:name: tensorflow-gpuspec:containers:- name: tensorflow-gpuimage: tensorflow/tensorflow:1.15.0-py3-jupyterimagePullPolicy: Neverresources:limits:aliyun.com/gpu-mem: 1024ports:- containerPort: 8888 --- apiVersion: v1 kind: Service metadata:name: tensorflow-gpu spec:ports:- port: 8888targetPort: 8888nodePort: 30888name: jupyterselector:name: tensorflow-gputype: NodePort查看集群GPU使用情況
[root@master ~]# kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(MiB) master 10.18.5.20 1024/15109 0/15109 1024/30218 node 10.18.5.21 0/15109 0/15109 0/30218 ------------------------------------------------------------------ Allocated/Total GPU Memory In Cluster: 1024/60436 (1%) [root@master ~]#可通過(guò)動(dòng)態(tài)伸縮tensorflow service 的節(jié)點(diǎn)數(shù)量以及修改單個(gè)節(jié)點(diǎn)的顯存大小測(cè)試GPU使用情況
kubectl scale --current-replicas=1 --replicas=100 deployment/tensorflow-gpu經(jīng)測(cè)試,得出以下測(cè)試結(jié)果:
環(huán)境
| master | 2 | 15109M*2=30218M |
| node | 2 | 15109M*2=30218M |
測(cè)試結(jié)果
| 256M | 183 | 77% |
| 512M | 116 | 98% |
| 1024M | 56 | 94% |
安裝kubeflow(V1.0.RC)
安裝ks
tar -vxf ks_0.12.0_linux_amd64.tar.gzcp ks_0.12.0_linux_amd64/* /usr/local/bin/安裝kuberflow
下載安裝包
kfctl_v1.0-rc.3-1-g24b60e8_linux.tar.gz
tar -zxvf kfctl_v1.0-rc.3-1-g24b60e8_linux.tar.gz cp kfctl /usr/bin/準(zhǔn)備工作,創(chuàng)建PV及PVC,使用NFS作為文件存儲(chǔ)
創(chuàng)建storageclass
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata:name: local-pathnamespace: kubeflow #provisioner: example.com/nfs provisioner: kubernetes.io/gce-pd parameters:type: pd-ssdkubectl create -f storage.yml yum install nfs-utils rpcbind #創(chuàng)建NFS掛載目錄(至少需要四個(gè)) mkdir -p /data/nfsvim /etc/exports #添加上面的掛載目錄 /data/nfs 192.168.122.0/24(rw,sync)systemctl restart nfs-server.service創(chuàng)建PV,因多個(gè)pod掛載文件可能重名,所以最好創(chuàng)建多個(gè)PV由pod選擇掛載(至少4個(gè),分別供katib-mysql,metadata-mysql,minio,mysql掛載)
[root@master pv]# cat mysql-pv.yml apiVersion: v1 kind: PersistentVolume metadata:name: local-path #不同的PVC需要修改 spec:capacity:storage: 200GiaccessModes:- ReadWriteOncepersistentVolumeReclaimPolicy: RecyclestorageClassName: local-pathnfs:path: /data/nfs #不同的PVC需要修改server: 10.18.5.20創(chuàng)建命名空間 kubeflow-anonymous
kubectl create namespace kubeflow-anonymous下載kubeflow1.0.RC yml文件, https://github.com/kubeflow/manifests/blob/v1.0-branch/kfdef/kfctl_k8s_istio.yaml
[root@master 2020-0219]# cat kfctl_k8s_istio.yaml apiVersion: kfdef.apps.kubeflow.org/v1 kind: KfDef metadata:clusterName: kubernetescreationTimestamp: nullname: 2020-0219namespace: kubeflow spec:applications:- kustomizeConfig:parameters:- name: namespacevalue: istio-systemrepoRef:name: manifestspath: istio/istio-crdsname: istio-crds- kustomizeConfig:parameters:- name: namespacevalue: istio-systemrepoRef:name: manifestspath: istio/istio-installname: istio-install- kustomizeConfig:parameters:- name: namespacevalue: istio-systemrepoRef:name: manifestspath: istio/cluster-local-gatewayname: cluster-local-gateway- kustomizeConfig:parameters:- name: clusterRbacConfigvalue: "OFF"repoRef:name: manifestspath: istio/istioname: istio- kustomizeConfig:parameters:- name: namespacevalue: istio-systemrepoRef:name: manifestspath: istio/add-anonymous-user-filtername: add-anonymous-user-filter- kustomizeConfig:repoRef:name: manifestspath: application/application-crdsname: application-crds- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: application/applicationname: application- kustomizeConfig:parameters:- name: namespacevalue: cert-managerrepoRef:name: manifestspath: cert-manager/cert-manager-crdsname: cert-manager-crds- kustomizeConfig:parameters:- name: namespacevalue: kube-systemrepoRef:name: manifestspath: cert-manager/cert-manager-kube-system-resourcesname: cert-manager-kube-system-resources- kustomizeConfig:overlays:- self-signed- applicationparameters:- name: namespacevalue: cert-managerrepoRef:name: manifestspath: cert-manager/cert-managername: cert-manager- kustomizeConfig:repoRef:name: manifestspath: metacontrollername: metacontroller- kustomizeConfig:overlays:- istio- applicationrepoRef:name: manifestspath: argoname: argo- kustomizeConfig:repoRef:name: manifestspath: kubeflow-rolesname: kubeflow-roles- kustomizeConfig:overlays:- istio- applicationrepoRef:name: manifestspath: common/centraldashboardname: centraldashboard- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: admission-webhook/bootstrapname: bootstrap- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: admission-webhook/webhookname: webhook- kustomizeConfig:overlays:- istio- applicationparameters:- name: userid-headervalue: kubeflow-useridrepoRef:name: manifestspath: jupyter/jupyter-web-appname: jupyter-web-app- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: spark/spark-operatorname: spark-operator- kustomizeConfig:overlays:- istio- application- dbrepoRef:name: manifestspath: metadataname: metadata- kustomizeConfig:overlays:- istio- applicationrepoRef:name: manifestspath: jupyter/notebook-controllername: notebook-controller- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pytorch-job/pytorch-job-crdsname: pytorch-job-crds- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pytorch-job/pytorch-operatorname: pytorch-operator- kustomizeConfig:overlays:- applicationparameters:- name: usageIdvalue: <randomly-generated-id>- name: reportUsagevalue: "true"repoRef:name: manifestspath: common/spartakusname: spartakus- kustomizeConfig:overlays:- istiorepoRef:name: manifestspath: tensorboardname: tensorboard- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: tf-training/tf-job-crdsname: tf-job-crds- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: tf-training/tf-job-operatorname: tf-job-operator- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: katib/katib-crdsname: katib-crds- kustomizeConfig:overlays:- application- istiorepoRef:name: manifestspath: katib/katib-controllername: katib-controller- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/api-servicename: api-service- kustomizeConfig:overlays:- applicationparameters:- name: minioPvcNamevalue: minio-pv-claimrepoRef:name: manifestspath: pipeline/minioname: minio- kustomizeConfig:overlays:- applicationparameters:- name: mysqlPvcNamevalue: mysql-pv-claimrepoRef:name: manifestspath: pipeline/mysqlname: mysql- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/persistent-agentname: persistent-agent- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/pipelines-runnername: pipelines-runner- kustomizeConfig:overlays:- istio- applicationrepoRef:name: manifestspath: pipeline/pipelines-uiname: pipelines-ui- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/pipelines-viewername: pipelines-viewer- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/scheduledworkflowname: scheduledworkflow- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: pipeline/pipeline-visualization-servicename: pipeline-visualization-service- kustomizeConfig:overlays:- application- istioparameters:- name: adminvalue: johnDoe@acme.comrepoRef:name: manifestspath: profilesname: profiles- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: seldon/seldon-core-operatorname: seldon-core-operator- kustomizeConfig:overlays:- applicationparameters:- name: namespacevalue: knative-servingrepoRef:name: manifestspath: knative/knative-serving-crdsname: knative-crds- kustomizeConfig:overlays:- applicationparameters:- name: namespacevalue: knative-servingrepoRef:name: manifestspath: knative/knative-serving-installname: knative-install- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: kfserving/kfserving-crdsname: kfserving-crds- kustomizeConfig:overlays:- applicationrepoRef:name: manifestspath: kfserving/kfserving-installname: kfserving-installrepos:- name: manifestsuri: https://github.com/kubeflow/manifests/archive/master.tar.gzversion: master status:reposCache:- localPath: '"../.cache/manifests/manifests-master"'name: manifests [root@master 2020-0219]# #進(jìn)入你的kubeflowapp目錄 執(zhí)行 kfctl apply -V -f kfctl_k8s_istio.yaml #安裝過(guò)程中需要從GitHub下載配置文件,可能會(huì)失敗,失敗時(shí)重試在kubeflowapp平級(jí)目錄下會(huì)生成kustomize文件夾,為防重啟時(shí)鏡像拉取失敗,需修改所有鏡像拉取策略為IfNotPresent
然后再次執(zhí)行 kfctl apply -V -f kfctl_k8s_istio.yaml
查看運(yùn)行狀態(tài)
kubectl get all -n kubeflow通過(guò)istio ingress訪問(wèn)kubeflowui
#修改ingeress-gateway訪問(wèn)方式為L(zhǎng)oadBalancer kubectl -n istio-system edit svc istio-ingressgateway #修改此處為L(zhǎng)oadBalancer selector:app: istio-ingressgatewayistio: ingressgatewayrelease: istio sessionAffinity: None type: LoadBalancer保存,再次查看該svc信息
[root@master 2020-0219]# kubectl -n istio-system get svc istio-ingressgateway NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE istio-ingressgateway LoadBalancer 10.98.19.247 10.18.5.30 15020:32230/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:31908/TCP,15030:31864/TCP,15031:31315/TCP,15032:30372/TCP,15443:32631/TCP 42h [root@master 2020-0219]#EXTERNAL-IP 即為外部訪問(wèn)地址,訪問(wèn)http://10.18.5.30 即可進(jìn)入kubeflow主頁(yè)
關(guān)于鏡像拉取,gcr鏡像國(guó)內(nèi)無(wú)法拉取,可以通過(guò)如下方式拉取
curl -s https://zhangguanzhang.github.io/bash/pull.sh | bash -s -- 鏡像信息若上述方法也無(wú)法拉取,可以使用阿里云手動(dòng)構(gòu)建鏡像方式使用海外服務(wù)器構(gòu)建
總結(jié)
以上是生活随笔為你收集整理的kubernetes集群使用GPU及安装kubeflow1.0.RC操作步骤的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Windows环境中在同一个Tomcat
- 下一篇: 使用 JSON JavaScriptSe