大数据工程师要学的编程_每个数据工程师都应了解的ml编程技巧,第2部分
大數(shù)據(jù)工程師要學(xué)的編程
現(xiàn)實(shí)世界中的DS(DS IN THE REAL WORLD)
這篇文章是下面提到的繼續(xù)。 (This post is in continuation with the one mentioned below.)
In the above post, I have presented some important programming takeaways to know and keep in mind while performing Machine Learning practices to make your implementation faster and effective. Following which we are going to see more of these hacks. Let us begin.
在上面的文章中,我介紹了一些重要的編程要點(diǎn),它們?cè)趫?zhí)行機(jī)器學(xué)習(xí)實(shí)踐時(shí)要了解并牢記,以使您的實(shí)現(xiàn)更快,更有效。 接下來(lái),我們將看到更多這些技巧。 讓我們開(kāi)始吧。
11.操縱寬和長(zhǎng)數(shù)據(jù)幀: (11. Manipulating Wide & Long DataFrames:)
The most effective method for converting wide to long data and long to wide data is pandas.melt() and pandas.pivot_table() function respectively. You will not need anything else to manipulate long and wide data into one another other than these functions.
轉(zhuǎn)換寬數(shù)據(jù)到長(zhǎng)數(shù)據(jù)和長(zhǎng)數(shù)據(jù)到寬數(shù)據(jù)的最有效方法分別是pandas.melt()和pandas.pivot_table()函數(shù)。 除了這些功能之外,您不需要其他任何東西就可以將長(zhǎng)而寬的數(shù)據(jù)相互轉(zhuǎn)換。
一種。 寬到長(zhǎng)(融化) (a. Wide to Long (Melt))
>>> import pandas as pd# create wide dataframe
>>> df_wide = pd.DataFrame(
... {"student": ["Andy", "Bernie", "Cindy", "Deb"],
... "school": ["Z", "Y", "Z", "Y"],
... "english": [66, 98, 61, 67], # eng grades
... "math": [87, 48, 88, 47], # math grades
... "physics": [50, 30, 59, 54] # physics grades
... }
... )
>>> df_wide
student school english math physics
0 Andy Z 66 87 50
1 Bernie Y 98 48 30
2 Cindy Z 61 88 59
3 Deb Y 67 47 54
>>> df_wide.melt(id_vars=["student", "school"],
... var_name="subject", # rename
... value_name="score") # rename
student school subject score
0 Andy Z english 66
1 Bernie Y english 98
2 Cindy Z english 61
3 Deb Y english 67
4 Andy Z math 87
5 Bernie Y math 48
6 Cindy Z math 88
7 Deb Y math 47
8 Andy Z physics 50
9 Bernie Y physics 30
10 Cindy Z physics 59
11 Deb Y physics 54
b。 長(zhǎng)到寬(數(shù)據(jù)透視表) (b. Long to Wide (Pivot Table))
>>> import pandas as pd# create long dataframe
>>> df_long = pd.DataFrame({
... "student":
... ["Andy", "Bernie", "Cindy", "Deb",
... "Andy", "Bernie", "Cindy", "Deb",
... "Andy", "Bernie", "Cindy", "Deb"],
... "school":
... ["Z", "Y", "Z", "Y",
... "Z", "Y", "Z", "Y",
... "Z", "Y", "Z", "Y"],
... "class":
... ["english", "english", "english", "english",
... "math", "math", "math", "math",
... "physics", "physics", "physics", "physics"],
... "grade":
... [66, 98, 61, 67,
... 87, 48, 88, 47,
... 50, 30, 59, 54]
... })
>>> df_long
student school class grade
0 Andy Z english 66
1 Bernie Y english 98
2 Cindy Z english 61
3 Deb Y english 67
4 Andy Z math 87
5 Bernie Y math 48
6 Cindy Z math 88
7 Deb Y math 47
8 Andy Z physics 50
9 Bernie Y physics 30
10 Cindy Z physics 59
11 Deb Y physics 54
>>> df_long.pivot_table(index=["student", "school"],
... columns='class',
... values='grade')
class english math physics
student school
Andy Z 66 87 50
Bernie Y 98 48 30
Cindy Z 61 88 59
Deb Y 67 47 54
12.交叉表: (12. Cross Tabulation:)
When you need to summarise the data, cross tabulation plays a great role to aggregate two or more factors and compute the frequency table for the values. It can be implemented with pandas.crosstab() function which also allows to find the normalized values while printing the output using ‘normalize’ parameter.
當(dāng)您需要匯總數(shù)據(jù)時(shí),交叉表在匯總兩個(gè)或更多因素并計(jì)算這些值的頻率表方面發(fā)揮著重要作用。 可以使用pandas.crosstab()函數(shù)實(shí)現(xiàn)該函數(shù),該函數(shù)還允許在使用'normalize'參數(shù)打印輸出時(shí)查找歸一化的值。
>>> import numpy as np>>> import pandas as pd
>>> p = np.array(["s1", "s1", "s1", "s1", "b1", "b1",
... "b1", "b1", "s1", "s1", "s1"], dtype=object)
>>> q = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> r = np.array(["x", "x", "y", "x", "x", "y",
... "y", "x", "y", "y", "y"], dtype=object)
>>> pd.crosstab(p, [q, r], rownames=['p'], colnames=['q', 'r'])
q one two
r x y x y
p
b1 1 2 1 0
s1 2 2 1 2# get normalized output values
>>> pd.crosstab(p, [q, r], rownames=['p'], colnames=['q', 'r'], normalize=True)
q one two
r x y x y
p
b1 0.090909 0.181818 0.090909 0.000000
s1 0.181818 0.181818 0.090909 0.181818
13. Jupyter主題: (13. Jupyter Themes:)
The one of the best libraries in Python is jupyterthemes that allows you to change and control the style of the notebook view that most of the ML practitioners work upon. As different themes like having dark mode, light mode, etc. or custom styling is preferred by most of the programmers and it can be achieved in Jupyter notebooks using jupyterthemes library.
Python中最好的庫(kù)之一是jupyterthemes,它使您可以更改和控制大多數(shù)ML從業(yè)人員從事的筆記本視圖的樣式。 由于大多數(shù)程序員都喜歡不同的主題,例如具有暗模式,亮模式等或自定義樣式,因此可以使用jupyterthemes庫(kù)在Jupyter筆記本中實(shí)現(xiàn)。
# pip install$ pip install jupyterthemes# conda install
$ conda install -c conda-forge jupyterthemes# list available themes
$ jt -l
Available Themes:
chesterish
grade3
gruvboxd
gruvboxl
monokai
oceans16
onedork
solarizedd
solarizedl# apply the theme
jt -t chesterish# reverse the theme
!jt -r
You can find more about it here on Github https://github.com/dunovank/jupyter-themes.
您可以在Github上找到更多有關(guān)它的信息https://github.com/dunovank/jupyter-themes 。
14.將分類(lèi)轉(zhuǎn)換為虛擬變量: (14. Convert Categorical to Dummy Variable:)
Using pandas.get_dummies() function, you can directly convert the categorical features in the DataFrame to Dummy variables along with drop_first=True to remove the first redundant column.
使用pandas.get_dummies()函數(shù),可以將DataFrame中的分類(lèi)功能與drop_first = True一起直接轉(zhuǎn)換為Dummy變量,以刪除第一個(gè)冗余列。
>>> import pandas as pd>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})>>> df
A B C
0 a b 1
1 b a 2
2 a c 3>>> pd.get_dummies(df[['A','B']])
A_a A_b B_a B_b B_c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1>>> dummy = pd.get_dummies(df[['A','B']], drop_first=True)
>>> dummy
A_b B_b B_c
0 0 1 0
1 1 0 0
2 0 0 1# concat dummy features to existing df
>>> df = pd.concat([df, dummy], axis=1)>>> df
A B C A_b B_b B_c
0 a b 1 0 1 0
1 b a 2 1 0 0
2 a c 3 0 0 1
15.轉(zhuǎn)換為數(shù)字: (15. Convert into Numeric:)
While loading dataset into pandas, sometimes the numeric column is taken object type and numeric operations cannot be performed on the same. In order to convert them to numeric, we can use pandas.to_numeric() function and update existing Series, or column in DataFrame.
在將數(shù)據(jù)集加載到熊貓中時(shí),有時(shí)會(huì)將數(shù)字列作為對(duì)象類(lèi)型,并且不能在同一列上執(zhí)行數(shù)字操作。 為了將它們轉(zhuǎn)換為數(shù)字,我們可以使用pandas.to_numeric()函數(shù)并更新現(xiàn)有的Series或DataFrame中的列。
>>> import pandas as pd>>> s = pd.Series(['1.0', '2', -3, '12', 5])
>>> s
0 1.0
1 2
2 -3
3 12
4 5
dtype: object>>> pd.to_numeric(s)
0 1.0
1 2.0
2 -3.0
3 12.0
4 5.0
dtype: float64>>> pd.to_numeric(s, downcast='signed')
0 1
1 2
2 -3
3 12
4 5
dtype: int8
16.分層采樣/拆分: (16. Stratified Sampling/Splitting:)
When splitting the dataset, we need to obtain sample population in data splits at times. It is more effective when the classes are not balanced enough in the dataset. In sklearn.model_selection.train_test_split() function, a parameter named “stratify” can be set with target class feature to correctly split the data with same ratio as present in unsplitted dataset for different classes.
拆分?jǐn)?shù)據(jù)集時(shí),我們有時(shí)需要獲取數(shù)據(jù)拆分中的樣本總體。 當(dāng)類(lèi)在數(shù)據(jù)集中不夠平衡時(shí),它會(huì)更有效。 在sklearn.model_selection .train_test_split()函數(shù)中,可以使用目標(biāo)類(lèi)別功能設(shè)置名為“ stratify ”的參數(shù),以與未分割數(shù)據(jù)集中不同類(lèi)別的比率正確分割數(shù)據(jù)。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
17.按類(lèi)型選擇特征: (17. Selecting Features By Type:)
In most of the datasets, we have both types of columns, i.e. Numerical, and Non-Numerical. We often have the need to extract only the numerical columns or categorical columns in the dataset and perform some visualization functions or custom manipulations on the same. In pandas library, we have DataFrame.select_dtypes() function which selects the specific columns from the given dataset that matches the specified datatype.
在大多數(shù)數(shù)據(jù)集中,我們有兩種類(lèi)型的列,即數(shù)值列和非數(shù)值列。 我們經(jīng)常需要僅提取數(shù)據(jù)集中的數(shù)字列或分類(lèi)列,并對(duì)它們執(zhí)行一些可視化功能或自定義操作。 在熊貓庫(kù)中,我們具有DataFrame.select_dtypes()函數(shù),該函數(shù)從給定的數(shù)據(jù)集中選擇與指定數(shù)據(jù)類(lèi)型匹配的特定列。
>>> import pandas as pd>>> df = pd.DataFrame({'a': [1, 2] * 3,
... 'b': [True, False] * 3,
... 'c': [1.0, 2.0] * 3})
>>> df
a b c
0 1 True 1.0
1 2 False 2.0
2 1 True 1.0
3 2 False 2.0
4 1 True 1.0
5 2 False 2.0>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False>>> df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0>>> df.select_dtypes(exclude=['int64'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0
18. RandomizedSearchCV: (18. RandomizedSearchCV:)
RandomizedSearchCV is a function from sklearn.model_selectionclass that is used to determine random set of hyperparameters for the mentioned learning algorithm, it randomly selects different values for each hyperparameter provided to tune and applied cross-validations on each selected value and determine the best one of them using different scoring mechanism provided while searching.
RandomizedSearchCV是sklearn.model_selection類(lèi)的一個(gè)函數(shù),用于為所提到的學(xué)習(xí)算法確定隨機(jī)的超參數(shù)集,它為提供的每個(gè)超參數(shù)隨機(jī)選擇不同的值,以調(diào)整和應(yīng)用對(duì)每個(gè)選定值的交叉驗(yàn)證,并確定最佳選擇之一。他們使用搜索時(shí)提供的不同評(píng)分機(jī)制。
>>> from sklearn.datasets import load_iris>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from scipy.stats import uniform
>>> iris = load_iris()
>>> logistic = LogisticRegression(solver='saga', tol=1e-2,
... max_iter=300,random_state=12)
>>> distributions = dict(C=uniform(loc=0, scale=4),
... penalty=['l2', 'l1'])>>> clf = RandomizedSearchCV(logistic, distributions, random_state=0)>>> search = clf.fit(iris.data, iris.target)>>> search.best_params_
{'C': 2..., 'penalty': 'l1'}
19.魔術(shù)功能-歷史記錄: (19. Magic function — %history:)
A batch of previously ran commands in the notebook can be accessed using ‘%history’ magic function. This will provide all previously executed commands and can be provided custom options to select the specific history commands which you can check using ‘%history?’ in jupyter notebook.
可以使用'%history'魔術(shù)功能訪(fǎng)問(wèn)筆記本中一批以前運(yùn)行的命令。 這將提供所有以前執(zhí)行的命令,并可以提供自定義選項(xiàng)以選擇特定的歷史命令,您可以使用'%history?'進(jìn)行檢查。 在jupyter筆記本中。
In [1]: import mathIn [2]: math.sin(2)
Out[2]: 0.9092974268256817
In [3]: math.cos(2)
Out[3]: -0.4161468365471424In [16]: %history -n 1-3
1: import math
2: math.sin(2)
3: math.cos(2)
20.下劃線(xiàn)快捷方式(_): (20. Underscore Shortcuts (_):)
In python, you can directly print the last output sent by the interpreter using print(_) function with underscore. This might not be that helpful, but in IPython (jupyter notebook), this feature has been extended and you can print any nth last output using n underscores within print() function. E.g. print(__) with two underscores will give you second-to-last output which skips all command that has no output.
在python中,您可以使用帶下劃線(xiàn)的print(_)函數(shù)直接打印解釋器發(fā)送的最后輸出。 這可能沒(méi)有幫助,但是在IPython(jupyter筆記本)中,此功能已得到擴(kuò)展,您可以在print()函數(shù)中使用n下劃線(xiàn)打印任何n個(gè)最后輸出。 例如帶有兩個(gè)下劃線(xiàn)的print(__)將為您提供倒數(shù)第二個(gè)輸出,該輸出將跳過(guò)所有沒(méi)有輸出的命令。
Also, another is underscore followed by line number prints the associated output.
此外,另一個(gè)是下劃線(xiàn),其后是行號(hào),以打印相關(guān)的輸出。
In [1]: import mathIn [2]: math.sin(2)
Out[2]: 0.9092974268256817
In [3]: math.cos(2)
Out[3]: -0.4161468365471424In [4]: print(_)
-0.4161468365471424
In [5]: print(__)
0.9092974268256817In [6]: _2
Out[13]: 0.9092974268256817
That’s all for now. I will present more of these important hacks/functions that every data engineer should know about in more next few parts.
目前為止就這樣了。 我將在接下來(lái)的幾個(gè)部分中介紹每個(gè)數(shù)據(jù)工程師都應(yīng)該了解的這些重要的技巧/功能。
Stay tuned.
敬請(qǐng)關(guān)注。
Photo by Howie R on Unsplash照片由Howie R在Unsplash上拍攝Thanks for reading. You can find my other Machine Learning related posts here.
謝謝閱讀。 您可以在這里找到我其他與機(jī)器學(xué)習(xí)有關(guān)的帖子。
I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.
希望這篇文章對(duì)您有所幫助。 我感謝反饋和建設(shè)性的批評(píng)。 如果您想談?wù)摫疚幕蚱渌嚓P(guān)主題,可以在此處或在LinkedIn上給我發(fā)短信。
翻譯自: https://towardsdatascience.com/ml-programming-hacks-that-every-data-engineer-should-know-part-2-61c0df0f215c
大數(shù)據(jù)工程師要學(xué)的編程
總結(jié)
以上是生活随笔為你收集整理的大数据工程师要学的编程_每个数据工程师都应了解的ml编程技巧,第2部分的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: vue 改变i标签icon图标的大小
- 下一篇: 光合作用c3和c5变化语言叙述,浅议光合