数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致
數(shù)據(jù)科學(xué)家數(shù)據(jù)分析師
According to a recent survey conducted by Dimensional Research, only 50 percent of data analysts’ time is actually spent analyzing data. What’s the other half spent on? Data cleanup — that tedious and repetitive work that must be done before you can dig into the fancy data science stuff. I’m talking about deduplication, fuzzy matching, replacing invalid characters — basically, all the data wrangling and munging you need to do to make the data easier to understand and work with.
根據(jù)Dimensional Research最近進行的一項調(diào)查,實際上只有50%的數(shù)據(jù)分析師時間用于分析數(shù)據(jù)。 另一半花在什么上面? 數(shù)據(jù)清理-必須先完成乏味且重復(fù)的工作,然后才能深入研究花哨的數(shù)據(jù)科學(xué)資料。 我說的是重復(fù)數(shù)據(jù)刪除,模糊匹配,替換無效字符-基本上,您需要對所有數(shù)據(jù)進行整理和整理以使數(shù)據(jù)更易于理解和使用。
Typically, data manipulation is accomplished one of two ways, each of which has pros and cons. The first method relies primarily on SQL, which is great for doing the joins, unions, and deduplications that are the bread and butter of data cleanup. For those specific actions that SQL is unable to perform, for example extracting word counts from unstructured text, you simply embed user-defined functions (UDFs) written in a general-purpose programming language, usually Python.
通常,數(shù)據(jù)操作是通過以下兩種方式之一完成的,每種方式都有其優(yōu)缺點。 第一種方法主要依賴于SQL,這非常適合執(zhí)行聯(lián)接,聯(lián)合和重復(fù)數(shù)據(jù)刪除,而重復(fù)數(shù)據(jù)刪除是數(shù)據(jù)清理的基礎(chǔ)。 對于SQL無法執(zhí)行的那些特定操作,例如從非結(jié)構(gòu)化文本中提取單詞計數(shù),您只需嵌入用通用編程語言(通常是Python)編寫的用戶定義函數(shù)(UDF)。
The second approach uses a general-purpose programming language, such as Python or Scala, as the “point of entry” for working with data. Operations that you would do in SQL, like joins, are provided by a data frame library like Pandas. Many data scientists naturally gravitate to this approach because they have more experience with Python or Scala, and they view SQL as a lesser tool primarily for business analysts. However, they are missing out on some big benefits of the SQL-first approach:
第二種方法使用通用編程語言(例如Python或Scala)作為處理數(shù)據(jù)的“入口點”。 您將在SQL中執(zhí)行的操作(例如聯(lián)接)由數(shù)據(jù)框架庫(例如Pandas)提供。 許多數(shù)據(jù)科學(xué)家自然傾向于使用這種方法,因為他們在Python或Scala方面擁有更多經(jīng)驗,并且他們將SQL視為主要用于業(yè)務(wù)分析人員的較少工具。 但是,它們沒有充分利用SQL優(yōu)先方法的一些優(yōu)點:
- The most common data-cleanup operations produce simpler code in SQL. Simpler code makes it easier for others to understand and harder for you to make mistakes; 最常見的數(shù)據(jù)清理操作會在SQL中產(chǎn)生更簡單的代碼。 更簡單的代碼使其他人更容易理解,并且更容易出錯。
- SQL is ubiquitous among data analysts, so it’s easier to share code with analysts; SQL在數(shù)據(jù)分析人員中無處不在,因此與分析人員共享代碼更加容易。
- It’s easier to hire for SQL expertise than Python or Scala. 雇用SQL專家比使用Python或Scala容易。
These benefits I just described are “human-focused,” but there is also a very important infrastructure benefit as well. Massively Parallel Processing (MPP) systems, like Snowflake and BigQuery, will automatically distribute your code across an arbitrarily large compute cluster if you write it in SQL.
我剛剛描述的這些好處是“以人為本”的,但是,還有一個非常重要的基礎(chǔ)架構(gòu)好處。 大規(guī)模并行處理(MPP)系統(tǒng)(例如Snowflake和BigQuery),如果您使用SQL編寫代碼,則會自動將代碼分布在任意大型的計算集群中。
On the other hand, if you use Python or Scala dataframes as your primary programming model, you will often need to specify data distributions and other details of how the system spreads your computation across nodes. The resulting execution plan is usually less efficient than what a SQL-based system would have produced, thanks to write barriers as well as extra serialization and deserialization steps. This last point is increasingly important when you’re working with larger data sets. That’s not to say it’s impossible to distribute your workload effectively when using a dataframe-based system, but you’ll be doing infrastructure work that doesn’t add value instead of spending your time getting insights from data.
另一方面,如果您將Python或Scala數(shù)據(jù)框用作主要的編程模型,則通常需要指定數(shù)據(jù)分布以及系統(tǒng)如何在節(jié)點之間分布計算的其他詳細(xì)信息。 由于寫障礙以及額外的序列化和反序列化步驟,最終的執(zhí)行計劃通常效率不如基于SQL的系統(tǒng)。 當(dāng)您使用較大的數(shù)據(jù)集時,這最后一點變得越來越重要。 這并不是說在使用基于數(shù)據(jù)幀的系統(tǒng)時不可能有效地分配工作負(fù)載,但是您將進行的基礎(chǔ)架構(gòu)工作不會增加價值,而不是花費時間從數(shù)據(jù)中獲取洞察力。
Lastly and most importantly, by making SQL your foundation, you can avoid creating two competing camps within your organization, data scientists versus analysts. With everyone in alignment about how data manipulation is accomplished, your team can focus on the deep data analysis that’s increasingly important in business today.
最后也是最重要的一點是,通過使SQL成為基礎(chǔ),您可以避免在組織內(nèi)創(chuàng)建兩個競爭陣營,即數(shù)據(jù)科學(xué)家與分析師。 使每個人都對如何完成數(shù)據(jù)操作保持一致,您的團隊可以專注于深度數(shù)據(jù)分析,該分析在當(dāng)今業(yè)務(wù)中變得越來越重要。
翻譯自: https://towardsdatascience.com/aligning-your-analysts-and-data-scientists-around-data-manipulation-fefe80d46c51
數(shù)據(jù)科學(xué)家數(shù)據(jù)分析師
總結(jié)
以上是生活随笔為你收集整理的数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到手机摔成两半了是什么意思
- 下一篇: 做梦梦到同事升职了怎么回事