数据清理最终实现了自动化
蘋(píng)果 | GOOGLE | 現(xiàn)貨 | 其他 (APPLE | GOOGLE | SPOTIFY | OTHERS)
Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris. Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:
編者按:邁向數(shù)據(jù)科學(xué)播客的“攀登數(shù)據(jù)科學(xué)階梯”系列由杰里米·哈里斯(Jeremie Harris)主持。 杰里米(Jeremie)幫助運(yùn)營(yíng)一家名為 SharpestMinds 的數(shù)據(jù)科學(xué)指導(dǎo)創(chuàng)業(yè)公司 。 您可以收聽(tīng)以下播客:
It’s cliché to say that data cleaning accounts for 80% of a data scientist’s job, but it’s directionally true.
俗話說(shuō),數(shù)據(jù)清理工作占數(shù)據(jù)科學(xué)家工作的80%,但這在方向上是正確的。
That’s too bad, because fun things like data exploration, visualization and modelling are the reason most people get into data science. So it’s a good thing that there’s a major push underway in industry to automate data cleaning as much as possible.
太糟糕了,因?yàn)橹T如數(shù)據(jù)探索,可視化和建模之類(lèi)的有趣事物是大多數(shù)人進(jìn)入數(shù)據(jù)科學(xué)的原因。 因此,業(yè)界正在大力推動(dòng)盡可能自動(dòng)執(zhí)行數(shù)據(jù)清理的一件好事。
One of the leaders of that effort is Ihab Ilyas, a professor at the University of Waterloo and founder of two companies, Tamr and Inductiv, both of which are focused on the early stages of the data science lifecycle: data cleaning and data integration. Ihab knows an awful lot about data cleaning and data engineering, and has some really great insights to share about the future direction of the space — including what work is left for data scientists, once you automate away data cleaning.
這項(xiàng)工作的領(lǐng)導(dǎo)者之一是滑鐵盧大學(xué)的教授,兩家公司Tamr和Inductiv的創(chuàng)始人Ihab Ilyas,??這兩家公司都致力于數(shù)據(jù)科學(xué)生命周期的早期階段:數(shù)據(jù)清理和數(shù)據(jù)集成。 艾哈布(Ihab)對(duì)數(shù)據(jù)清理和數(shù)據(jù)工程知識(shí)非常了解,并且對(duì)于共享空間的未來(lái)方向具有真正的深刻見(jiàn)解,包括一旦您將數(shù)據(jù)清理自動(dòng)化后將為數(shù)據(jù)科學(xué)家留下的工作。
Here were some of my biggest takeaways from the conversation:
以下是這次對(duì)話中我最大的收獲:
- Data cleaning involves a lot of things, one of which is dealing with missing values. Historically, missing values have often been filled in manually by subject matter experts who can make educated guesses about the data, but automated techniques can work well (and usually do better) at scale. 數(shù)據(jù)清理涉及很多事情,其中??之一就是處理缺失的值。 從歷史上看,缺少的值通常是由主題專(zhuān)家手動(dòng)填充的,他們可以對(duì)數(shù)據(jù)進(jìn)行有根據(jù)的猜測(cè),但是自動(dòng)化技術(shù)可以很好地發(fā)揮作用(并且通常做得更好)。
- These automated strategies can range from fairly naive approaches (e.g. replacing a value with the median or average value of other points in the dataset), to more sophisticated techniques (e.g. using a predictive model to guess at missing values). 這些自動(dòng)化策略的范圍從相當(dāng)幼稚的方法(例如,用數(shù)據(jù)集中其他點(diǎn)的中位數(shù)或平均值替換一個(gè)值)到更復(fù)雜的技術(shù)(例如,使用預(yù)測(cè)模型來(lái)猜測(cè)缺失值)。
- The distinction between different parts of the data science lifecycle are often arbitrary, but clearly defining the boundaries between data cleaning, data exploration and modelling is nonetheless essential to ensure that problems can be solved in a contained and modular fashion. This idea is one part of the data science best practices that make up DataOps, a topic we’ve discussed on the podcast before. 數(shù)據(jù)科學(xué)生命周期的不同部分之間的區(qū)分通常是任意的,但是清楚地定義數(shù)據(jù)清理,數(shù)據(jù)探索和建模之間的界限對(duì)于確保可以以封閉和模塊化的方式解決問(wèn)題至關(guān)重要。 這個(gè)想法是構(gòu)成DataOps的數(shù)據(jù)科學(xué)最佳實(shí)踐的一部分,這是我們之前在播客上討論的主題。
- It’s clear that data cleaning, like modelling, is not immune to automation. As a result, it’s likely that data scientists will find themselves leaning more and more into their subject matter expertise, communication and engineering skills in the future, rather than spending their time on dealing with missing values, hyperparameter optimization or model selection. 顯然,數(shù)據(jù)清理與建模一樣,也無(wú)法避免自動(dòng)化。 結(jié)果,數(shù)據(jù)科學(xué)家很可能會(huì)發(fā)現(xiàn)自己將來(lái)會(huì)越來(lái)越傾向于主題專(zhuān)業(yè)知識(shí),溝通和工程技能,而不是將時(shí)間花在處理缺失值,超參數(shù)優(yōu)化或模型選擇上。
You can follow Ihab on Twitter here and you can follow me on Twitter here.
您可以遵循埃哈卜的Twitter在這里 ,你可以按照我的Twitter 這里 。
翻譯自: https://towardsdatascience.com/data-cleaning-is-finally-being-automated-8cc964ea2e12
總結(jié)
以上是生活随笔為你收集整理的数据清理最终实现了自动化的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 为什么在Python代码中需要装饰器
- 下一篇: 女人梦到佛像吊坠预示什么