kaggle研究生招生(下)
生活随笔
收集整理的這篇文章主要介紹了
kaggle研究生招生(下)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
對于該數據先采用回歸算法,再轉為分類算法,這次使用聚類算法
聚類算法(無監督機器學習算法)
df = pd.read_csv("../input/Admission_Predict.csv",sep = ",") df=df.rename(columns = {'Chance of Admit ':'ChanceOfAdmit'}) serial = df["Serial No."] df.drop(["Serial No."],axis=1,inplace = True) df = (df- np.min(df))/(np.max(df)-np.min(df)) y = df.ChanceOfAdmit x = df.drop(["ChanceOfAdmit"],axis=1)所有特征(x)均采用主成分分析法收集在一個特征中。
PCA
# for data visualization from sklearn.decomposition import PCA pca = PCA(n_components = 1, whiten= True ) # whitten = normalize pca.fit(x) x_pca = pca.transform(x) x_pca = x_pca.reshape(400,) dictionary = {"x":x_pca,"y":y} data = pd.DataFrame(dictionary) print("data:") print(data.head()) print("\ndf:") print(df.head())K-means
利用肘形法確定K均值聚類的最佳聚類數是3
df["Serial No."] = serial from sklearn.cluster import KMeans wcss = [] for k in range(1,15):kmeans = KMeans(n_clusters=k)kmeans.fit(x)wcss.append(kmeans.inertia_) plt.plot(range(1,15),wcss) plt.xlabel("k values") plt.ylabel("WCSS") plt.show()kmeans = KMeans(n_clusters=3) clusters_knn = kmeans.fit_predict(x)df["label_kmeans"] = clusters_knnplt.scatter(df[df.label_kmeans == 0 ]["Serial No."],df[df.label_kmeans == 0].ChanceOfAdmit,color = "red") plt.scatter(df[df.label_kmeans == 1 ]["Serial No."],df[df.label_kmeans == 1].ChanceOfAdmit,color = "blue") plt.scatter(df[df.label_kmeans == 2 ]["Serial No."],df[df.label_kmeans == 2].ChanceOfAdmit,color = "green") plt.title("K-means Clustering") plt.xlabel("Candidates") plt.ylabel("Chance of Admit") plt.show()df["label_kmeans"] = clusters_knn plt.scatter(data.x[df.label_kmeans == 0 ],data[df.label_kmeans == 0].y,color = "red") plt.scatter(data.x[df.label_kmeans == 1 ],data[df.label_kmeans == 1].y,color = "blue") plt.scatter(data.x[df.label_kmeans == 2 ],data[df.label_kmeans == 2].y,color = "green") plt.title("K-means Clustering") plt.xlabel("X") plt.ylabel("Chance of Admit") plt.show()
層次聚類
采用樹形圖方法確定了層次聚類的最佳聚類數又是3
df["Serial No."] = serialfrom scipy.cluster.hierarchy import linkage, dendrogram merg = linkage(x,method="ward") dendrogram(merg,leaf_rotation = 90) plt.xlabel("data points") plt.ylabel("euclidean distance") plt.show()from sklearn.cluster import AgglomerativeClustering hiyerartical_cluster = AgglomerativeClustering(n_clusters = 3,affinity= "euclidean",linkage = "ward") clusters_hiyerartical = hiyerartical_cluster.fit_predict(x)df["label_hiyerartical"] = clusters_hiyerarticalplt.scatter(df[df.label_hiyerartical == 0 ]["Serial No."],df[df.label_hiyerartical == 0].ChanceOfAdmit,color = "red") plt.scatter(df[df.label_hiyerartical == 1 ]["Serial No."],df[df.label_hiyerartical == 1].ChanceOfAdmit,color = "blue") plt.scatter(df[df.label_hiyerartical == 2 ]["Serial No."],df[df.label_hiyerartical == 2].ChanceOfAdmit,color = "green") plt.title("Hierarchical Clustering") plt.xlabel("Candidates") plt.ylabel("Chance of Admit") plt.show()plt.scatter(data[df.label_hiyerartical == 0 ].x,data.y[df.label_hiyerartical == 0],color = "red") plt.scatter(data[df.label_hiyerartical == 1 ].x,data.y[df.label_hiyerartical == 1],color = "blue") plt.scatter(data[df.label_hiyerartical == 2 ].x,data.y[df.label_hiyerartical == 2],color = "green") plt.title("Hierarchical Clustering") plt.xlabel("X") plt.ylabel("Chance of Admit") plt.show()
聚類算法比較
k-均值聚類和層次聚類是相似的。
總結
- 任何一個數據都可以進行回歸,分類,甚至聚類的算法,關鍵如何處理數據采用不同種類的算法
總結
以上是生活随笔為你收集整理的kaggle研究生招生(下)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 日利率0.045年利率是多少
- 下一篇: 三十、开始R语言之旅