良/恶性乳腺癌肿瘤预测
生活随笔
收集整理的這篇文章主要介紹了
良/恶性乳腺癌肿瘤预测
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1. Python的內建模塊itertools提供了非常有用的用于操作迭代對象的函數:
- itertools.count():會創建一個無限的迭代器,只能按Ctrl+C退出
- itertools.cycle():會把傳入的序列無限重復下去,同樣停不下來
- itertools.repeat(‘A’, 10):負責把一個元素無限重復下去,不過如果提供第二個參數就可以限定重復次數。
無限序列只有在for迭代時才會無限地迭代下去,如果只是創建了一個迭代對象,它不會事先把無限個元素生成出來,事實上也不可能在內存中創建無限多個元素。無限序列雖然可以無限迭代下去,但是通常我們會通過takewhile()等函數根據條件判斷來截取出一個有限的序列:ns = itertools.takewhile(lambda x: x <= 10, natuals) - itertools.chain():可以把一組迭代對象串聯起來,形成一個更大的迭代器:
- itertools.groupby():把迭代器中相鄰的重復元素挑出來放在一起
詳細參考:itertools
2. pandas中的loc用法
index = list(itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci'])) headr = list(itertools.product(['Exams','Labs'],['I','II'])) indx = pd.MultiIndex.from_tuples(index,names=['Student','Course']) cols = pd.MultiIndex.from_tuples(headr) data = [[70+x+y+(x*y)%3 for x in range(4)] for y in range(9)] df = pd.DataFrame(data,indx,cols) print df,'\n','........................................' All = slice(None) print All,'\n','......................................' print df.loc['Violet'],'\n','..............................' print df.loc[(All,'Math'),All],'\n','..............................' print df.loc[(slice('Ada','Quinn'),'Math'),All],'\n','..............................' print df.loc[(All,'Math'),('Exams')],'\n','..............................' print df.loc[(All,'Math'),(All,'II')],'\n','..............................'運行結果:
Exams Labs I II I II Student Course Ada Comp 70 71 72 73Math 71 73 75 74Sci 72 75 75 75 Quinn Comp 73 74 75 76Math 74 76 78 77Sci 75 78 78 78 Violet Comp 76 77 78 79Math 77 79 81 80Sci 78 81 81 81 ........................................ slice(None, None, None) ......................................Exams Labs I II I II Course Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 ..............................Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 Violet Math 77 79 81 80 ..............................Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 ..............................I II Student Course Ada Math 71 73 Quinn Math 74 76 Violet Math 77 79 ..............................Exams LabsII II Student Course Ada Math 73 74 Quinn Math 76 77 Violet Math 79 80 ..............................3. 完全代碼:
import pandas as pd # 導入pandas工具包,并且更名為pd# 繪制良/惡性乳腺癌腫瘤測試集數據分布# 注意這里的路徑分隔符,windows和linux是不同的,統一用格式:r+'路徑' df_train = pd.read_csv(r'C:\Users\LiLong\Desktop\kaggle_lea\Breast-Cancer\breast_cancer_train.csv') df_test = pd.read_csv(r'C:\Users\LiLong\Desktop\kaggle_lea\Breast-Cancer\breast_cancer_test.csv') #print df_train# 選取'Clump Thickness', 'Cell Size'作為特征,構建測試集中的正負樣例 df_test_negative = df_test.loc[df_test['Type'] == 0][['Clump Thickness', 'Cell Size']] df_test_positive = df_test.loc[df_test['Type'] == 1][['Clump Thickness', 'Cell Size']] #print df_test_negativeimport matplotlib.pyplot as plt plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'], marker = 'o', s=200, c='red') plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'], marker = 'x', s=150, c='black') plt.xlabel('Clump Thickness') plt.ylabel('Cell Size') plt.show() # 隨機參數下的二類分類器 import numpy as np # 隨機采樣直線的截距和系數 intercept = np.random.random([1]) coef = np.random.random([2]) lx = np.arange(0, 12) ly = (-intercept - lx * coef[0]) / coef[1] # 繪制一條隨機直線 plt.plot(lx, ly, c='yellow') # 繪制樣例散點圖 plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'], marker = 'o', s=200, c='red') plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'], marker = 'x', s=150, c='black') plt.xlabel('Clump Thickness') plt.ylabel('Cell Size') plt.show() [ 0.2154933 0.95448152] # 導入sklearn中的邏輯斯地回歸分類器 from sklearn.linear_model import LogisticRegression lr = LogisticRegression()#使用前10條數據訓練樣本學習直線的系數和截距 lr.fit(df_train[['Clump Thickness', 'Cell Size']][:10], df_train['Type'][:10]) print 'Testing accuracy (10 training samples):', lr.score(df_test[['Clump Thickness', 'Cell Size']], df_test['Type'])Testing accuracy (10 training samples): 0.868571428571 # 準確率
# 使用10條訓練樣本得到的二分類器 lr2 = LogisticRegression() # 截距和系數由sklearn.linear_model中的LogisticRegression確定 intercept = lr.intercept_ # 得到截距 print intercept coef = lr.coef_[0, :] print coef ly = (-intercept - lx * coef[0]) / coef[1] # 由ax+by+c=0 =》 y=-(a/b)x-c/b plt.plot(lx, ly, c='green') plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'], marker = 'o', s=200, c='red') plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'], marker = 'x', s=150, c='black') plt.xlabel('Clump Thickness') plt.ylabel('Cell Size') plt.show()[-1.51522787]
[-0.10721332 0.48314152]
Testing accuracy (all training samples): 0.937142857143
intercept = lr2.intercept_ coef = lr2.coef_[0, :] ly = (-intercept - lx * coef[0]) / coef[1] plt.plot(lx, ly, c='blue') plt.scatter(df_test_negative['Clump Thickness'],df_test_negative['Cell Size'], marker = 'o', s=200, c='red') plt.scatter(df_test_positive['Clump Thickness'],df_test_positive['Cell Size'], marker = 'x', s=150, c='black') plt.xlabel('Clump Thickness') plt.ylabel('Cell Size') plt.show()這里用的是jupyter運行的代碼。
源代碼:《python機器學習及實踐》
源代碼下載
總結
以上是生活随笔為你收集整理的良/恶性乳腺癌肿瘤预测的全部內容,希望文章能夠幫你解決所遇到的問題。