當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据科学家实操之路

發布時間：2024/8/23 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了数据科学家实操之路小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

摘要：?Kaggle最近進行了一項旨在評估數據科學和機器學習當前發展狀況的調查。他們收到了將近17000份答卷，并利用這些答卷做出了大量的分析。

Kaggle最近進行了一項旨在評估數據科學和機器學習當前發展狀況的調查。他們收到了將近17000份答卷，并利用這些答卷做出了大量的分析。對于調查結果的分析報告，我并不感興趣，我只是想看看這些調查結果是否對我這種想知道如何成為數據科學家的人來說是否有用。

如果你對分析過程并不感興趣，而只想看看17000個行業專業人士的說法，那么請跳到本文的最后一節閱讀結論。否則，請繼續閱讀下文，看看我是如何得出結論的。

1. 導入和預處理

1.1. 導入數據

library(data.table)library(dplyr, warn.conflicts =FALSE)library(ggplot2)library(tibble)results <- as.tibble(suppressWarnings(fread("../input/multipleChoiceResponses.csv")))

1.2. 數據預處理

在分析過程中遇到的許多職位和專業的名字都很長，因此，我縮短了其中一些名字的長度，這樣可以讓圖表更具可讀性。我刪除了其中的非法字符，并將工作滿意度轉換為數字。

results$CurrentJobTitleSelect[results$CurrentJobTitleSelect =="Software Developer/Software Engineer"]<-"Software Engineer" results$CurrentJobTitleSelect[results$CurrentJobTitleSelect =="Operations Research Practitioner"]<-"Operations Research"results$MajorSelect[results$MajorSelect =="Engineering (non-computer focused)"]<-"Engineering" results$MajorSelect[results$MajorSelect =="Information technology, networking, or system administration"]<-"Information technology" results$MajorSelect[results$MajorSelect =="Management information systems"]<-"Information systems"results_names <-names(results) results_names[results_names =="WorkMethodsFrequencyA/B"]<-"WorkMethodsFrequencyABTesting" results_names[results_names =="WorkMethodsFrequencyCross-Validation"]<-"WorkMethodsFrequencyCrossValidation"names(results)<- results_namesresults$JobSatisfaction <-suppressWarnings(as.numeric(substr(results$JobSatisfaction, start =1, stop =2)))

1.3. 主題

在整個分析過程中，我需要顯示多個數據，因此，我這里要提一下這個可以讓圖表變得更漂亮的東西。我將在大多數的圖表中都使用這個主題。

jack_theme <- theme(plot.background = element_rect(fill ="#eeeeee"),panel.background = element_rect(fill ="#eeeeee"),legend.background = element_rect(fill ="#eeeeee"),legend.title = element_text(size =12, family ="Helvetica", face ="bold"),legend.text = element_text(size =9, family ="Helvetica"),panel.grid.major = element_line(size =0.4, linetype ="solid", color ="#cccccc"),panel.grid.minor = element_line(size =0),plot.title = element_text(size =20, family ="Helvetica", face ="bold", hjust =0.5, margin = margin(b =20)),axis.title = element_text(size =14, family ="Helvetica", face ="bold"),axis.title.x = element_text(margin = margin(t =20)),axis.title.y = element_text(margin = margin(r =20)),axis.ticks = element_blank(),plot.margin = unit(c(1,1,1,1),"cm"))

2. 開始

2.1. 編程語言

許多新手問自己的第一個問題是他們應該學習哪種編程語言。在這個圖表中，我們可以看到各個職位的被調查者們是如何看待Python和R語言的。

results %>%rename(Language = LanguageRecommendationSelect, title = CurrentJobTitleSelect)%>%filter(Language =="R"| Language =="Python")%>%filter(title !="")%>%group_by(title, Language)%>%count()%>%ggplot(aes(title, n, color = Language))+ggtitle("Python vs. R by Job Title")+labs(x ="Job Title", y ="Count")+geom_point(size =4)+jack_theme +theme(axis.text.x = element_text(angle =330, hjust =0),axis.title.x = element_text(margin = margin(t =12)))

正如我們所看到的，在幾乎所有的場景中，初學者對Python的喜歡程度都超過了R，對于數據科學家、軟件開發人員和機器學習工程師來說，尤其如此。不過，統計學家更喜歡R。研究員也更喜歡R，但我認為這是由于樣本太小的原因導致的。

2.2. 專業和可能的職位

我還沒上大學，所以還沒有選擇專業。但是，我在直覺上認為，有些專業比其他專業更具擔任某些職位的可能性。下面的圖表可以用來證實或反駁我的觀點，我們來看下某個專業的人們有多大的比例從事某個職位。

results %>%rename(Major = MajorSelect, title = CurrentJobTitleSelect)%>%filter(title !="", Major !="")%>%group_by(title, Major)%>%summarize(n = n())%>%mutate(freq = n /sum(n)*100)%>%ggplot(aes(x = title, y = freq, fill = Major, label =ifelse(freq >8,round(freq),"")))+ggtitle("Major vs. Job Title")+labs(x ="Job Title", y ="Frequency (%)")+geom_bar(stat ="identity", position = position_stack())+geom_text(position = position_stack(vjust =0.5))+scale_y_continuous(expand =c(0,0))+jack_theme +theme(axis.title.x = element_text(margin = margin(t =8)),axis.text.x = element_text(angle =320, hjust =0))

我的直覺基本上是正確的，專業是預測某人未來會從事哪些工作的最好依據。其中一些趨勢是顯而易見的：計算機科學專業的人傾向于成為計算機科學家、程序員和軟件工程師，而數學專業則傾向于成為統計學家和預測模型師。大部分物理學專業都進入了研究領域，而非計算機科學工程專業的人則稱自己為工程師。

我個人非常喜歡這個圖表，每個專業的人都涵蓋了所有的職位。對我來說，這表明不管你在學校里學的是什么專業，只要有激情，你就可以做好你想做的事。

3. 學什么

3.1. 學習資源

調查問卷中有一個問題是關于學習資源對學習數據科學的用處有多大。在下一張圖表中，我繪制了每個學習資源受歡迎程度和有用性之間的關系。受歡迎度是回答這個問題的人數，有用性則是答案和回答者數量的的加權平均值。如果答案是“非常有用”，則權重為1，“有點有用”權重為0.5，如果沒有用處，則權重為零。在這個圖表中，我們不僅可以看到哪里的資源最有用，而且可以知道哪里的資源被過度使用或利用不足。

# Get all column names that begin with "LearningPlatformUsefulness" platforms <-grep("^LearningPlatformUsefulness",names(results), value=T)names <-c() popularities <-c() scores <-c()for(platform in platforms){usefulness <- results %>%group_by_(platform)%>%count()# Popularity = the number of people who responded to this questionpopularity <- usefulness[[2]][2]+ usefulness[[2]][3]+ usefulness[[2]][4]# Usefulness = a weighted average determining the usefulness of this platformscore <-(usefulness[[2]][2]*0+ usefulness[[2]][3]*0.5+ usefulness[[2]][4]*1)/ popularitynames <-c(names,gsub("LearningPlatformUsefulness","", platform))popularities <-c(popularities, popularity)scores <-c(scores, score)}scores_df <-data.frame(Popularity = popularities,Usefulness = scores,Name =names)ggplot(scores_df, aes(x = Usefulness, y = Popularity))+ggtitle("Effectiveness of Learning Methods")+geom_point()+geom_text(aes(label = Name, family ="Helvetica"), nudge_y =200)+jack_theme

從上圖中我們可以看出一些淺顯的結果。播客、時事通訊和會議在有用性方面是最低的，而Kaggle競賽、Stack Overflow、在線課程，以及項目的得分最高。我們也可以看到，雖然很多人喜歡看YouTube教程和閱讀博客帖子，但這可能并不是學習數據科學最有效的方法。

3.2. 重要的工作技能

對于哪些技能在工作中最重要這個問題，我們可以使用與創建上一個圖相類似的方法來創建一張新的圖。在這張新圖中，我同樣繪制了流行度與有用性之間的關系，看看哪些技術在現實世界中使用得最多。

# Get all column names that begin with "JobSkillImportance" and end in a letter platforms <-grep("^JobSkillImportance.*[A-z]$",names(results), value=T)names <-c() popularities <-c() scores <-c()for(platform in platforms){usefulness <- results %>%group_by_(platform)%>%count()# Popularity = the number of people who responded to this questionpopularity <- usefulness[[2]][2]+ usefulness[[2]][3]+ usefulness[[2]][4]# Usefulness = a weighted average determining the usefulness of this platformscore <-(usefulness[[2]][2]*1+ usefulness[[2]][3]*0.5+ usefulness[[2]][4]*0)/ popularitynames <-c(names,gsub("JobSkillImportance","", platform))popularities <-c(popularities, popularity)scores <-c(scores, score)}scores_df <-data.frame(Popularity = popularities,Usefulness = scores,Name =names)ggplot(scores_df, aes(x = Usefulness, y = Popularity))+ggtitle("Important Skills on the Job")+geom_point()+geom_text(aes(label = Name, family ="Helvetica"), nudge_y =12)+jack_theme

有趣的是，MOOC在實用性方面得分最低，即使是在線課程在最后一個問題中也達到了第二高的分數。因此，雖然課程對學習數據科學很有幫助，但其他一些學習資源對于提高你的工作技能也非常有用。我們可以看到，對于提高數據科學工作的技能而言，有關Python、高級統計和可視化工具方面的知識的排名最高。

3.3. 現實世界中的工具

調查問卷中有一個問題是詢問受訪者使用某種技術的頻繁程度。我沒有把這個做成一個散點圖，而是把數據放在一個有多個職位的表格中，因為這些答案在本質上已經反應了受歡迎水平。

# Get all column names that begin with "WorkToolsFrequency" and end in a letter platforms <-grep("^WorkToolsFrequency.*[A-z]$",names(results), value=T) positions <-c("All","Data Scientist","Software Engineer","Researcher","Machine Learning Engineer")technologies <-matrix("",10,length(positions))colnames(technologies)<- positions rownames(technologies)<-1:10i <-1for(position in positions){names <-c()popularities <-c()if(position =="All"){position_results <- results}else{position_results <- results %>% filter(CurrentJobTitleSelect == position)}for(platform in platforms){usefulness <- position_results %>%group_by_(platform)%>%count()# Popularity = the number of people who responded to this questionpopularity <- usefulness[[2]][2]+ usefulness[[2]][3]+ usefulness[[2]][4]+ usefulness[[2]][5]# Usefulness = a weighted average determining how much this tool was usedscore <-(usefulness[[2]][2]*1+ usefulness[[2]][3]*0.67+ usefulness[[2]][4]*0.33+ usefulness[[2]][5]*0)/ popularitynames <-c(names,as.character(gsub("WorkToolsFrequency","", platform)))popularities <-c(popularities, popularity * score)}scores_df <-data.frame(Popularity = popularities,Name =names)technologies[, i]<-head(as.character((scores_df %>% arrange(desc(Popularity)))$Name), n =10)i <- i +1}technologies

從這個表格中可以看出，Python、SQL、R、Jupyter、Unix和TensorFlow無處不在。而Spark、Hadoop和Tableau是數據科學家所特有的，NoSQL是軟件工程師所特有的，MATLAB只有研究人員和機器學習工程師使用。雖然其中一些結果可能很容易被猜到，但是對于那些希望在將來從事這些工作的人來說，這確實是一個很有用的表格。

3.4. 重要的工作方法

我們再一次使用了制作之前的表格的方法，但是把工作中使用的技能替換成了方法。在這里，我們可以看到受訪者使用的前10種方法。

# Get all column names that begin with "WorkToolsFrequency" and end in a letter methods <-grep("^WorkMethodsFrequency.*[A-z]$",names(results), value=T) positions <-c("All","Data Scientist","Software Engineer","Researcher","Machine Learning Engineer")technologies <-matrix("",10,length(positions))colnames(technologies)<- positions rownames(technologies)<-1:10i <-1for(position in positions){names <-c()popularities <-c()if(position =="All"){position_results <- results}else{position_results <- results %>% filter(CurrentJobTitleSelect == position)}for(method in methods){usefulness <- position_results %>%group_by_(method)%>%count()# Popularity = the number of people who responded to this questionpopularity <- usefulness[[2]][2]+ usefulness[[2]][3]+ usefulness[[2]][4]+ usefulness[[2]][5]# Usefulness = a weighted average determining how much this tool was usedscore <-(usefulness[[2]][2]*1+ usefulness[[2]][3]*0.67+ usefulness[[2]][4]*0.33+ usefulness[[2]][5]*0)/ popularitynames <-c(names,as.character(gsub("WorkMethodsFrequency","", method)))popularities <-c(popularities, popularity * score)}scores_df <-data.frame(Popularity = popularities,Name =names)technologies[, i]<-head(as.character((scores_df %>% arrange(desc(Popularity)))$Name), n =10)i <- i +1}technologies-所有人數據科學家軟件工程師研究人員機器學習工程師

1	數據可視化	數據可視化	數據可視化	數據可視化	交叉驗證
2	交叉驗證	交叉驗證	交叉驗證	交叉驗證	神經網絡
3	邏輯回歸	邏輯回歸	神經網絡	神經網絡	數據可視化
4	決策樹	隨機森林	邏輯回歸	邏輯回歸	卷積神經網絡
5	隨機森林	決策樹	決策樹	主成分分析	自然語言處理
6	時序分析	時序分析	時序分析	時序分析	邏輯回歸
7	神經網絡	文本分析	文本分析	卷積神經網絡	隨機森林
8	主成分分析	集成方法	隨機森林	決策樹	決策樹
9	文本分析	主成分分析	自然語言處理	支持向量機	主成分分析
10	鄰近算法	GBM	A/B測試	自然語言處理	集成方法

從這個表格中可以看出，每個人都在使用數據可視化、交叉驗證、邏輯回歸和決策樹。自然語言處理和神經網絡對于機器學習工程師來說使用更為頻繁，而軟件工程師是唯一一個經常使用A/B測試的職業。

4. 結論

如果你是一位數據科學新手，并且想入門，那么我們以上的分析為你提供了以下這些建議：

學習Python。Python和R都已經有數十年的歷史了，但是正如我們在第一個圖表中看到的那樣，Python獲得了全勝。這個結果也在第四張圖中得到了支持，絕大多數的參與者都認為Python是工作中最重要的技能。我相信你很難找到一家完全不使用Python的公司，所以你應該好好學習。

計算機科學或數學專業。正如我在“專業和職位圖”之后所提到的，任何一個專業在每個工作角色中都存在。然而，從圖中的比例來看，計算機科學專業和數學專業的人數幾乎在每一個角色中都是最多的。雖然這不是必需的，但擁有這兩個專業中的任何一個或同時擁有兩個專業的學歷會給你帶來很大的優勢。

參加項目、參加課程，并參與Kaggle比賽。從“最佳學習資源的圖表”中我們可以很清楚的看到，參加項目、參加在線課程以及參與Kaggle比賽是學習數據科學三個最有用的資源。

了解最流行的工具。數據科學相關的工具和庫很多很多，但是這次調查讓我們看到了被認為是最重要的工具和庫。推薦最多的工具是Python、SQL、R、Jupyter和Unix，最值得推薦的方法是數據可視化、交叉驗證，邏輯回歸，決策樹和隨機森林。

文章原標題《How to Become a Data Scientist》，作者： Jack Cook，譯者：夏天，審校：主題曲。

原文鏈接

干貨好文，請關注掃描以下二維碼：

總結

以上是生活随笔為你收集整理的数据科学家实操之路的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：为什么MaxCompute采用列式存储？
下一篇：数据科学家需要掌握的10项统计技术，快来