数据分箱技术Binning
生活随笔
收集整理的這篇文章主要介紹了
数据分箱技术Binning
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
數據分箱技術Binning
- 數據分箱技術Binning
- 引入相關庫
- 數據獲取
- 數據分箱
數據分箱技術Binning
引入相關庫
import numpy as np import pandas as pd from pandas import Series,DataFrame數據獲取
產生一些考試的成績分數,一共20個數據在25到100之間
score_list=np.random.randint(25,100,size=20) score_list array([66, 40, 32, 55, 81, 91, 49, 57, 36, 96, 83, 55, 98, 38, 36, 82, 71,39, 73, 60])數據分箱
把0-59作為不及格,60-70作為ok,70-80作為良好,80-100作為優秀,定義一個list作為數據的取值范圍
bins=[0,59,70,80,100]通過cut方法,第一個參數為對哪個數據做分箱,第二個參數作為分箱的范圍
pd.cut(score_list,bins) [(59, 70], (0, 59], (0, 59], (0, 59], (80, 100], ..., (80, 100], (70, 80], (0, 59], (70, 80], (59, 70]] Length: 20 Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]] score_cat=pd.cut(score_list,bins)統計每個分數段人數的結果
pd.value_counts(score_cat) (0, 59] 10 (80, 100] 6 (70, 80] 2 (59, 70] 2 dtype: int64創建一個空的DataFrame
df=DataFrame()把‘score’這一列用score_list賦值
df['score']=score_list df| 66 |
| 40 |
| 32 |
| 55 |
| 81 |
| 91 |
| 49 |
| 57 |
| 36 |
| 96 |
| 83 |
| 55 |
| 98 |
| 38 |
| 36 |
| 82 |
| 71 |
| 39 |
| 73 |
| 60 |
通過rands生成隨機的20個字符串,作為學生的姓名
df['student']=[pd.util.testing.rands(3) for i in range(20)] df| 66 | mDG |
| 40 | pCe |
| 32 | Iuv |
| 55 | sWt |
| 81 | eaR |
| 91 | 5Gw |
| 49 | 8Xc |
| 57 | Tu2 |
| 36 | IRS |
| 96 | VQU |
| 83 | lsc |
| 55 | nek |
| 98 | cFQ |
| 38 | ZeB |
| 36 | Lfi |
| 82 | jYv |
| 71 | x6q |
| 39 | t9I |
| 73 | CJg |
| 60 | hF4 |
把DataFrame的score作cut
pd.cut(df['score'],bins) 0 (59, 70] 1 (0, 59] 2 (0, 59] 3 (0, 59] 4 (80, 100] 5 (80, 100] 6 (0, 59] 7 (0, 59] 8 (0, 59] 9 (80, 100] 10 (80, 100] 11 (0, 59] 12 (80, 100] 13 (0, 59] 14 (0, 59] 15 (80, 100] 16 (70, 80] 17 (0, 59] 18 (70, 80] 19 (59, 70] Name: score, dtype: category Categories (4, interval[int64]): [(0, 59] < (59, 70] < (70, 80] < (80, 100]]把做完cut的Categorie賦給DataFrame的‘Categories’
df['Categories ']=pd.cut(df['score'],bins) df| 66 | mDG | (59, 70] |
| 40 | pCe | (0, 59] |
| 32 | Iuv | (0, 59] |
| 55 | sWt | (0, 59] |
| 81 | eaR | (80, 100] |
| 91 | 5Gw | (80, 100] |
| 49 | 8Xc | (0, 59] |
| 57 | Tu2 | (0, 59] |
| 36 | IRS | (0, 59] |
| 96 | VQU | (80, 100] |
| 83 | lsc | (80, 100] |
| 55 | nek | (0, 59] |
| 98 | cFQ | (80, 100] |
| 38 | ZeB | (0, 59] |
| 36 | Lfi | (0, 59] |
| 82 | jYv | (80, 100] |
| 71 | x6q | (70, 80] |
| 39 | t9I | (0, 59] |
| 73 | CJg | (70, 80] |
| 60 | hF4 | (59, 70] |
定義一個list,作為labels方法的參數,通過標簽的方法,把Categories標志出來,更適合人的閱讀
df['Categories ']=pd.cut(df['score'],bins,labels=['Low','OK','Good','Great']) df| 66 | mDG | OK |
| 40 | pCe | Low |
| 32 | Iuv | Low |
| 55 | sWt | Low |
| 81 | eaR | Great |
| 91 | 5Gw | Great |
| 49 | 8Xc | Low |
| 57 | Tu2 | Low |
| 36 | IRS | Low |
| 96 | VQU | Great |
| 83 | lsc | Great |
| 55 | nek | Low |
| 98 | cFQ | Great |
| 38 | ZeB | Low |
| 36 | Lfi | Low |
| 82 | jYv | Great |
| 71 | x6q | Good |
| 39 | t9I | Low |
| 73 | CJg | Good |
| 60 | hF4 | OK |
總結
以上是生活随笔為你收集整理的数据分箱技术Binning的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: OpenCasCade学习笔记(三):加
- 下一篇: PD协议诱骗芯片使用浅谈