pd种知道每个数据的类型_每个数据科学家都应该知道的5个概念
pd種知道每個(gè)數(shù)據(jù)的類型
意見 (Opinion)
目錄 (Table of Contents)
介紹 (Introduction)
I have written about common skills that Data Scientists can expect to use in their professional careers, so now I want to highlight some key concepts of Data Science that can be beneficial to know and later employ. I may be discussing some that you know already, and some that you do not know; my goal is to provide some professional explanation of why these concepts are beneficial regardless of what you do know now. Multicollinearity, one-hot encoding, undersampling and oversampling, error metrics, and lastly, storytelling, are the key concepts I think of first when thinking of a professional Data Scientist in their day-to-day. The last point, perhaps, is a combination of skill and a concept but wanted to highlight, still, its importance on your everyday work life as a Data Scientist. I will expound upon all of these concepts down below.
我已經(jīng)寫了關(guān)于數(shù)據(jù)科學(xué)家可以在其職業(yè)生涯中期望使用的常見技能的文章,所以現(xiàn)在我想重點(diǎn)介紹一些數(shù)據(jù)科學(xué)的關(guān)鍵概念,這些知識(shí)可能有益于知識(shí)并在以后使用。 我可能正在討論您已經(jīng)知道的一些,以及您不知道的一些。 我的目標(biāo)是提供一些專業(yè)的解釋,說(shuō)明無(wú)論您現(xiàn)在知道什么,這些概念為何都是有益的。 多重共線性,單次編碼,欠采樣和過(guò)采樣,錯(cuò)誤度量,以及講故事,是我每天在考慮專業(yè)數(shù)據(jù)科學(xué)家時(shí)首先想到的關(guān)鍵概念。 最后一點(diǎn)也許是技巧和概念的結(jié)合,但仍然想強(qiáng)調(diào)它對(duì)您作為數(shù)據(jù)科學(xué)家的日常工作的重要性。 我將在下面詳細(xì)說(shuō)明所有這些概念。
多重共線性 (Multicollinearity)
Photo by The Creative Exchange on Unsplash [2].圖片由The Creative Exchange在Unsplash上??提供[2]。Although the word is somewhat long and hard to say, when you break it down, multicollinearity is simple. Multi meaning many, and collinearity meaning linearly related. Multicollinearity can be described as the situation when two or more explanatory variables explain similar information or are highly related in a regression model. There are a few reasons this concept can raise a concern.
盡管這個(gè)詞有點(diǎn)長(zhǎng)且很難說(shuō),但將其分解時(shí),多重共線性很簡(jiǎn)單。 多含義很多,共線性含義線性相關(guān)。 多重共線性可以描述為當(dāng)兩個(gè)或多個(gè)解釋變量解釋相似信息或在回歸模型中高度相關(guān)時(shí)的情況。 此概念引起關(guān)注的原因有幾個(gè)。
For some modeling techniques, it can cause overfitting and ultimately a decline in model performance.
對(duì)于某些建模技術(shù),它可能導(dǎo)致過(guò)度擬合并最終導(dǎo)致模型性能下降。
The data becomes redundant and not each feature or attribute is needed in your model. Therefore, there are some ways to find out which features you should remove that constitute multicollinearity.
數(shù)據(jù)變得多余,并且模型中不需要每個(gè)功能或?qū)傩浴?因此,有一些方法可以找出應(yīng)刪除構(gòu)成多重共線性的特征。
variance inflation factor (VIF)
方差膨脹因子(VIF)
correlation matrices
相關(guān)矩陣
These two techniques are commonly used amongst Data Scientists, especially correlation matrices and plots — usually visualized with a heatmap of some sort, while VIF is lesser-known.
數(shù)據(jù)科學(xué)家通常使用這兩種技術(shù),尤其是相關(guān)矩陣和圖-通常以某種形式的熱圖可視化,而VIF則鮮為人知。
The higher the VIF value, the less usable the feature is for your regression model.
VIF值越高,該功能對(duì)您的回歸模型的使用就越少。
A great, simple resource for VIF is [3]:
VIF的一個(gè)很好的簡(jiǎn)單資源是[3]:
一站式編碼 (One-Hot Encoding)
This form of feature transformation in your model is called one-hot encoding. You want to represent your categorical features numerically by encoding them. Whereas the categorical features have text values themselves, one-hot encoding transposes that information so that each value becomes the feature and the observation in the row is either denoted as a 0 or 1. For example, if we have the categorical variable gender, the numerical representation after one-hot encoding would look like (gender before, and male/female after):
模型中這種形式的特征轉(zhuǎn)換稱為單次編碼。 您想通過(guò)編碼來(lái)以數(shù)字方式表示分類特征。 盡管分類要素本身具有文本值,但是一鍵編碼會(huì)轉(zhuǎn)置該信息,以便每個(gè)值都成為要素,并且該行中的觀察值將表示為0或1。例如,如果我們擁有分類變量sex ,則一鍵編碼后的數(shù)字表示看起來(lái)像( 性別之前和之后的男性/女性 ):
Before and after one-hot encoding. Screenshot by Author [4].一鍵編碼之前和之后。 作者[4]的屏幕截圖。This transformation is useful when you are not just working with numerical features, and need to create that numerical representation with text/categorical features.
當(dāng)您不僅要使用數(shù)字功能并且需要使用文本/分類功能創(chuàng)建該數(shù)字表示形式時(shí),此轉(zhuǎn)換非常有用。
采樣 (Sampling)
When you do not have enough data, oversampling may be suggested as a form of compensation. Say you are working on a classification problem and you have a minority class like the example down below:
當(dāng)您沒有足夠的數(shù)據(jù)時(shí),建議使用過(guò)采樣作為補(bǔ)償。 假設(shè)您正在處理分類問(wèn)題,并且有一個(gè)少數(shù)類,如下例所示:
class_1 = 100 rowsclass_2 = 1000 rowsclass_3 = 1100 rowsAs you can see, class_1 has a small amount of data for its class, which means your dataset is imbalanced and will be referred to as the minority class. There are several oversampling techniques. One of them is called SMOTE [5], which stands for Synthetic Minority Over-sampling Technique. One of the ways that SMOTE works is by utilizing a K-neighbor method for finding the nearest neighbor to create synthetic samples. There are similar techniques that use the reverse method for undersampling.
如您所見, class_1的類中包含少量數(shù)據(jù),這意味著您的數(shù)據(jù)集不平衡,將被稱為少數(shù)類。 有幾種過(guò)采樣技術(shù)。 其中之一稱為SMOTE [5],代表合成少數(shù)族裔過(guò)采樣技術(shù) 。 SMOTE工作的方法之一是利用K鄰域方法來(lái)找到最接近的鄰域以創(chuàng)建合成樣本。 有類似的技術(shù)使用反向方法進(jìn)行欠采樣 。
These techniques are beneficial when you have outliers in your class or regression data even, and you want to ensure your sampling is the best representation of the data that your model will run on in the future.
當(dāng)您的類或回歸數(shù)據(jù)中甚至有異常值時(shí),并且您要確保采樣是模型將在將來(lái)運(yùn)行的數(shù)據(jù)的最佳表示形式時(shí),這些技術(shù)將非常有用。
錯(cuò)誤指標(biāo) (Error Metrics)
There are plenty of error metrics used for both classification and regression models in Data Science. According to sklearn [6], here are some that you can use specifically for regression models:
在數(shù)據(jù)科學(xué)中,分類和回歸模型都有大量錯(cuò)誤度量標(biāo)準(zhǔn)。 根據(jù)sklearn [6],以下是您可以專門用于回歸模型的一些信息:
metrics.explained_variance_score
metrics.explained_variance_score
metrics.max_error
metrics.max_error
metrics.mean_absolute_error
metrics.mean_absolute_error
metrics.mean_squared_error
metrics.mean_squared_error
metrics.mean_squared_log_error
metrics.mean_squared_log_error
metrics.median_absolute_error
metrics.median_absolute_error
metrics.r2_score
metrics.r2_score
metrics.mean_poisson_deviance
metrics.mean_poisson_deviance
metrics.mean_gamma_deviance
metrics.mean_gamma_deviance
The two most popular error metrics for regression from above are MSE and RMSE:
從上方進(jìn)行回歸分析的兩個(gè)最受歡迎的錯(cuò)誤度量標(biāo)準(zhǔn)是MSE和RMSE:
MSE: the concept is → mean absolute error regression loss (sklearn)
MSE:概念是→平均絕對(duì)誤差回歸損失(sklearn)
RMSE: the concept is → mean squared error regression loss (sklearn)
RMSE:概念是→均方誤差回歸損失(sklearn)
For classification, you can expect to evaluate your model’s performance with accuracy and AUC (Area Under the Curve).
對(duì)于分類,您可以期望以準(zhǔn)確性和AUC(曲線下面積)評(píng)估模型的性能。
評(píng)書 (Storytelling)
Photo by Nong Vang on Unsplash [7]. Nong Vang在《 Unsplash 》上的照片 [7]。I wanted to add a unique concept of Data Science that is storytelling. I cannot stress enough how important this concept is. It can be seen as a concept or skill, but the label here is not important, what is, is how well you articulate your problem-solving techniques in a business setting. A lot of Data Scientists will focus solely on model accuracy, but will then fail to understand the entire business process. That process includes:
我想添加一個(gè)講故事的數(shù)據(jù)科學(xué)獨(dú)特概念。 我不能足夠強(qiáng)調(diào)這個(gè)概念的重要性。 可以將其視為概念或技能,但此處的標(biāo)簽并不重要,即您在業(yè)務(wù)環(huán)境中表達(dá)解決問(wèn)題技術(shù)的能力如何。 許多數(shù)據(jù)科學(xué)家將只專注于模型的準(zhǔn)確性,但隨后將無(wú)法理解整個(gè)業(yè)務(wù)流程。 該過(guò)程包括:
what is the business?
什么事
what is the problem?
問(wèn)題是什么?
why do we need Data Science?
為什么我們需要數(shù)據(jù)科學(xué)?
what is the goal of Data Science here?
數(shù)據(jù)科學(xué)的目標(biāo)是什么?
when will we get usable results?
我們什么時(shí)候可以獲得可用的結(jié)果?
how can we apply our results?
我們?nèi)绾螒?yīng)用我們的結(jié)果?
what is the impact of our results?
我們的結(jié)果有什么影響?
how do we share our results and overall process?
我們?nèi)绾畏窒砦覀兊慕Y(jié)果和整體流程?
As you can see, none of these points are the model itself/improvement in accuracy. The focus here is how you will use data to solve your company's problems. It is beneficial to become acquainted with stakeholders and your non-technical coworkers who you will ultimately be working with. You will also work with Product Managers who will work alongside you in assessing the problem, and Data Engineers to collect the data before even running a base model. At the end of your model process, you will share your results with key individuals who will usually like to see its impact in most likely some type of visual representation (Tableau, Google Slide deck, etc.), so being able to present and communicate is beneficial as well.
如您所見,這些要點(diǎn)都不是模型本身/準(zhǔn)確性的提高。 這里的重點(diǎn)是如何使用數(shù)據(jù)來(lái)解決公司的問(wèn)題。 結(jié)識(shí)最終將要與之合作的利益相關(guān)者和您的非技術(shù)合作伙伴是有益的。 您還將與產(chǎn)品經(jīng)理一起工作,他們將與您一起評(píng)估問(wèn)題,并與數(shù)據(jù)工程師一起甚至在運(yùn)行基本模型之前收集數(shù)據(jù)。 在建模過(guò)程的最后,您將與主要人員分享您的結(jié)果,這些人員通常希望看到其對(duì)某種視覺表示形式( Tableau,Google Slide卡座等 )的影響,從而能夠進(jìn)行演示和交流也是有益的。
摘要 (Summary)
There are plenty of key concepts Data Scientists, as well as Machine Learning Engineers, should know. Five of them discussed in this article were:
數(shù)據(jù)科學(xué)家以及機(jī)器學(xué)習(xí)工程師應(yīng)該知道很多關(guān)鍵概念。 本文討論的其中五個(gè)是:
MulticollinearityOne-hot encodingSamplingErrorStorytellingPlease feel free to comment down below some concepts of Data Science that you focus on daily, or that you think others should know about. Thank you for reading my article, I hope you enjoyed it!
請(qǐng)隨意在以下您每天關(guān)注的或您認(rèn)為其他人應(yīng)該知道的數(shù)據(jù)科學(xué)概念下進(jìn)行評(píng)論。 感謝您閱讀我的文章,希望您喜歡!
Below are some references and links that can provide more information on the topics discussed in this article.
下面是一些參考和鏈接,它們可以提供有關(guān)本文討論的主題的更多信息。
I also want to highlight two other stories I have written which are related to this article, [8] and [9]:
我還想強(qiáng)調(diào)我寫的另外兩個(gè)與本文有關(guān)的故事,[8]和[9]:
These two articles highlight key skills and projects you will need to either know or become familiar with and expect to eventually employ as a professional Data Scientist.
這兩篇文章重點(diǎn)介紹了您需要了解或熟悉的關(guān)鍵技能和項(xiàng)目,并期望他們最終成為專業(yè)的數(shù)據(jù)科學(xué)家。
翻譯自: https://towardsdatascience.com/5-concepts-every-data-scientist-should-know-16c74d080a83
pd種知道每個(gè)數(shù)據(jù)的類型
總結(jié)
以上是生活随笔為你收集整理的pd种知道每个数据的类型_每个数据科学家都应该知道的5个概念的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到僵尸和鬼魂是啥意思呢
- 下一篇: 经常梦到海啸是怎么回事