美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法
美團脫穎而出的經驗
The global COVID-19 pandemic has left many with a lot of time on their hands to work on their data science project portfolios. With everyone applying to jobs, how can you make sure that yours stand out? Read on to find out.
全球COVID-19大流行給許多人留下了很多時間來處理其數據科學項目組合。 每個人都應聘工作,如何確保自己脫穎而出? 請仔細閱讀,找出答案。
1.使用更多獨特的數據 (1. Use more unique data)
Photo by Randy Fath on Unsplash Randy Fath在Unsplash上拍攝的照片Iris, Galton, Titanic, Northwind Traders, Superstore, Go Data Warehouse. While you were studying data science in school, you no doubt came across at least one of these data sets. There is a reason for that, they help demonstrate concepts like clustering, regression, logistic regression, database structures, data visualization, or building reports. Each data set is clean and small, but that isn’t all they have in common: everyone has worked with these data sets. There are no new or exciting projects being built on the training data sets. No recruiter is going to look at your Titanic project (one of the most popular data sets on Kaggle) and say, we need this person on our team.
艾里斯(Iris),高爾頓(Galton),泰坦尼克號(Titanic),羅斯文交易者(Northwind Traders),大型超市,數據倉庫。 當您在學校學習數據科學時,毫無疑問,您至少會遇到這些數據集之一。 這是有原因的,它們有助于演示諸如聚類,回歸,邏輯回歸,數據庫結構,數據可視化或構建報告之類的概念。 每個數據集都很干凈,很小,但這并不是它們的共同點: 每個人都在使用這些數據集 。 培訓數據集沒有建立新的或令人興奮的項目。 沒有招聘人員會看您的Titanic項目(Kaggle上最受歡迎的數據集之一)并說,我們團隊中需要這個人。
There are no new or exciting projects being built on the training data sets
培訓數據集上沒有新的或令人興奮的項目
We live in the data age and that means that there is no shortage of data sets that are easily available for download. Get your data from somewhere more exciting than Kaggle or the data you learned machine learning on. A good place to branch out is to Data.gov. In 2013 President Obama signed an executive order making open and machine readable data the new default for government information. This means that there is a wealth of searchable information ready to download right from Data.gov. Federal student loan program data, federal aid to states data, and accidental drug related deaths are just a few of the over 200,000 data sets available for your use. Just make sure to look at the metadata provided with the file so you understand what you are working with.
我們生活在數據時代,這意味著不乏可供下載的數據集。 從比Kaggle更令人興奮的地方或機器學習的數據中獲取數據。 擴展到Data.gov的好地方。 2013年,奧巴馬總統簽署了一項行政命令,將開放和機器可讀的數據作為政府信息的新默認值。 這意味著可以從Data.gov直接下載大量可搜索的信息。 聯邦學生貸款計劃數據,聯邦政府對州的數據以及與毒品有關的意外死亡僅是可供使用的200,000多種數據集中的一部分。 只要確保查看文件隨附的元數據,就可以了解要使用的元數據。
Want to make things a little more personal? Use your own data! Anything you do can be turned into data. Many gyms are closed during stay at home orders, maybe you’re working out from home. All of those exercises you are doing can be tracked. Look at how many reps you are doing, which muscle groups you are working on, and what days you are working out on. The best part about using your own data is that you are the subject matter expert. You may end up with some smaller data sets to work with, but you will have a much deeper understanding of how it was captured and have control over adding new variables or dimensions to it.
是否想讓事情變得更加個人化? 使用您自己的數據! 您所做的任何事情都可以變成數據。 許多體育館在??吭诩业臅r候都會關閉,也許您是在家鍛煉。 您正在執行的所有練習都可以被跟蹤。 看看您正在做多少次練習,正在鍛煉哪些肌肉群以及正在鍛煉的日子。 有關使用自己的數據的最好部分是您是主題專家。 您可能最終會使用一些較小的數據集,但是您將對如何捕獲它有更深入的了解,并可以控制向其添加新變量或維度。
None of these sounding advanced enough? Take a look at web scraping. Web scraping is the automated process of collecting unstructured data from the internet. You will have to write the code in a language such as R or python to capture the data. You will have to do your own research about the values you capture and how the website you are scraping got those values. The end result will be much more unique, but it will also create more work to learn about the data and clean your collected data.
這些聽起來還不夠先進嗎? 看一下網頁抓取。 Web抓取是從Internet收集非結構化數據的自動化過程。 您將必須使用R或python等語言編寫代碼以捕獲數據。 您將必須對所捕獲的價值以及要抓取的網站如何獲得這些價值進行自己的研究。 最終結果將更加獨特,但是它還將創建更多工作來了解數據并清理您收集的數據。
2.進行數據清理項目 (2. Do a data cleaning project)
Photo by Kelly Sikkema on Unsplash Kelly Sikkema在Unsplash上的照片Speaking of data cleaning, real world data is disgusting, be sure to wear your face mask while working with it. Jokes aside, when someone asks for a model that uses data to predict customer churn, there is almost never a clean, ready to use data source to build that model from. Most classes will not prepare you to handle the sorts of dirty data that organizations have available. This is a critical skill that you need to showcase in at least one of your projects.
說到數據清理,現實世界中的數據令人作嘔,請務必在操作時戴上口罩。 撇開一個笑話,當有人要求使用數據來預測客戶流失的模型時,幾乎永遠不會有干凈的,隨時可用的數據源來構建該模型。 大多數類都不會讓您準備好處理組織可用的各種臟數據。 這是一項至關重要的技能,您需要在至少一個項目中進行展示。
Speaking of data cleaning, real world data is disgusting
說到數據清理,現實世界中的數據令人作嘔
There are many tasks that can be associated with cleaning data. A good place to start is understanding the data. Government and publicly available data will often have a data dictionary containing descriptions of each dimension, measure, observation, and table in the data. This will help you understand what data was collected, when it was collected, and who collected it.
有許多任務可以與清洗數據相關聯。 一個很好的起點是理解數據。 政府和公開可用的數據通常會有一個數據字典,其中包含數據中各個維度,度量,觀察值和表格的描述。 這將幫助您了解收集了哪些數據,何時收集以及誰收集的數據。
Understanding what you are looking at is a key to data validation. Once you know what a variable is you may be able to use the data dictionary, common sense, or a subject matter expert to determine which values don’t make sense. For example temperatures should fall in a certain range of values. If it is temperature data and the data dictionary specifies the units of measurement as Kelvin, any 0 or negative values would be suspect. If it is temperature data from Bermuda, warmer temperatures would make sense. Here, anything too hot or too cold would be suspect. For something like manufacturing welding temperatures, you may want to look to a professor or engineering student to give you more guidance. The key in this step is to find values that don’t look right.
了解您要查看的內容是數據驗證的關鍵。 一旦知道什么是變量,您就可以使用數據字典,常識或主題專家來確定哪些值沒有意義。 例如,溫度應在一定范圍內。 如果是溫度數據,并且數據字典將測量單位指定為開爾文,則可能會懷疑為0或負值。 如果是百慕大的溫度數據,則可以選擇較暖的溫度。 在這里,任何太熱或太冷的東西都會被懷疑。 對于制造焊接溫度之類的問題,您可能需要尋求教授或工程專業的學生提供更多指導。 此步驟的關鍵是查找看起來不正確的值。
Another area to look into is how to handle missing values. Like data validation, context matters when handling missing values. If you are looking at financial loan data for cars and a loan is in good standing and has a missing value for repossession status, you won’t be worried about that value being missing. If your project involves psychological assessments and you are missing answers to a lot of questions, you may take a different course of action, like eliminating the observation. Sometimes missing values make sense in your context. As with data validation, work with your subject matter experts and peers to understand what to do with missing values.
另一個需要研究的領域是如何處理缺失值。 像數據驗證一樣,上下文在處理缺失值時也很重要。 如果您正在查看汽車的金融貸款數據,并且貸款狀況良好且收回狀態的價值缺失,那么您就不必擔心該價值缺失。 如果您的項目涉及心理評估,而您缺少許多問題的答案,那么您可能會采取不同的行動方案,例如消除觀察。 有時,缺失的值在您的上下文中很有意義。 與數據驗證一樣,與主題專家和同行一起了解如何處理缺失的值。
Sometimes missing values make sense in your context
有時缺失的值在您的上下文中很有意義
3.使用版本控制 (3. Use version Control)
Photo by Yancy Min on Unsplash Yancy Min在Unsplash上拍攝的照片When doing data science projects, you likely spend a lot of time working with others, and this is just one of the many reasons to use version control. If you are not familiar with version control, it keeps all of the code in one place and allows multiple people to work on it at the same time. Everyone can add their contributions and review others’ code. If someone leaves the company, there isn’t a fight with the IT department to move the most recent codes to someone else’s machine. All code lives in a central, organized repository.
在進行數據科學項目時,您可能會花費大量時間與他人合作,而這僅僅是使用版本控制的眾多原因之一。 如果您不熟悉版本控制,它將所有代碼都放在一個地方,并允許多個人同時處理它。 每個人都可以添加自己的貢獻并查看其他人的代碼。 如果有人離開了公司,那么IT部門就不會將最新的代碼轉移到他人的機器上。 所有代碼都存放在一個中央的,有組織的存儲庫中。
All code lives in a central, organized repository
所有代碼都存放在中央的,有組織的存儲庫中
Another issue that plagues students is naming versions of files. Never again will you have 20 versions of “project_02_final.py” is different than “project_12_final_done_finished.py.” With version control, every version has comments and you can do a line by line comparison with any other version of your code to see what was deleted and added to it. You don’t even have to worry about an old version getting deleted, you can always roll back to an older version of the code.
困擾學生的另一個問題是命名文件的版本。 您再也不會有20個版本的“ project_02_final.py”與“ project_12_final_done_finished.py”不同。 使用版本控制,每個版本都有注釋,您可以與代碼的任何其他版本進行逐行比較,以查看已刪除和添加的內容。 您甚至不必擔心會刪除舊版本,可以隨時回滾到舊版本的代碼。
The best part about version control software is that it is easy to use once you get started. It can come as a standalone program, be integrated into your favorite integrated development environment (IDE), or be used through a command line interface. Many systems have additional features that allow you to create a website based on your repository, test builds of your code, and share your code by embedding it in other places. It is not only a method to keep your project organized, but it can also be used to start presenting your projects.
關于版本控制軟件最好的部分是,一旦開始,它就易于使用。 它可以作為獨立程序來使用,可以集成到您喜歡的集成開發環境(IDE)中,也可以通過命令行界面使用。 許多系統具有其他功能,使您可以基于存儲庫創建網站,測試代碼的構建以及通過將代碼嵌入其他位置來共享代碼。 它不僅是使您的項目井井有條的一種方法,而且還可以用于開始展示您的項目。
4.建立您的表現層 (4. Build your Presentation Layer)
Photo by Alex Litvin on Unsplash Alex Litvin在Unsplash上拍攝的照片The term data scientist is still relatively new, and it isn’t uncommon for executives to not fully understand data concepts. Regardless of whether or not they have a deep understanding of data science and machine learning, they need a way to quickly get the most important information your projects have to offer. This is your presentation layer.
術語“數據科學家”還相對較新,對于高管們來說,不完全理解數據概念并不罕見。 無論他們是否對數據科學和機器學習有深入的了解,他們都需要一種方法來快速獲取項目必須提供的最重要的信息。 這是您的表示層。
Most aspiring data scientists don’t focus enough on communication skills. Explaining how you used parallel computing in the cloud to train and test a model that you then deployed using a custom API for someone else to use won’t hold anyone’s attention. Management won’t always be as interested in how you do it, but why you did it. That isn’t to say the technical aspects are unimportant, but keep your audience in mind. The business side will want to see how the bottom line is impacted by your model, the IT side will care more about the actual implementation of it.
大多數有抱負的數據科學家對溝通技巧的關注不夠。 解釋如何在云中使用并行計算來訓練和測試然后使用自定義API部署給其他人使用的模型不會引起任何人的注意。 管理不會總是如感興趣的是你如何做到這一點,但你為什么這樣做。 這并不是說技術方面并不重要,但請記住您的受眾。 業務方面將希望了解您的模型對底線有何影響,IT方面將更關心它的實際實施。
Most aspiring data scientists don’t focus enough on communication skills
大多數有抱負的數據科學家對溝通技巧的關注不足
Your presentation layer can take many different forms. Once again, keep your audience in mind. Some people only want spreadsheets and tables while others want graphs and colors. Some expect it to be delivered by email and others want to see all of their figures from different areas on the same page. Take the tools you have access to and create different presentation layers with different audiences in mind.
您的表示層可以采用許多不同的形式。 再一次,請記住您的聽眾。 有些人只想要電子表格和表格,而另一些人想要圖形和顏色。 有些人希望通過電子郵件發送它,而另一些人希望在同一頁面上查看其不同區域的所有數據。 利用您可以使用的工具,并在考慮不同受眾的情況下創建不同的表示層。
5.嘗試各種技術 (5. Experiment with various technologies)
Elevate on Elevate on UnsplashUnsplash拍攝One of the most difficult things I experienced right out of college was seeing job openings for companies I wanted to work for and not having any relevant technology skills from the positing. A lot of companies struggle with on boarding and internal training. It is so much easier for them to hire someone who already knows what they need. Being familiar with a broader range of technologies will help in your search. Many organizations will even use multiple technologies together to piece together their data pipelines and data science environments.
我剛上大學時經歷過的最困難的事情之一就是看到我想工作的公司的工作機會,而且從假設中沒有任何相關的技術技能。 許多公司在登機和內部培訓方面都遇到困難。 對于他們來說,雇用已經知道自己需要什么的人要容易得多。 熟悉更廣泛的技術將對您的搜索有所幫助。 許多組織甚至會一起使用多種技術來組合其數據管道和數據科學環境。
There is the old joke that data science is 80% prepping data and 20% building models. There is truth to this. I have worked with data scientists who spent the vast majority of time on a project getting code to run correctly on a server instead of their desktop. I have interviewed at smaller companies where data scientists are expected to work very closely with the data engineers to build the data structures to gather data and support their models. There is a huge benefit to using multiple languages and software packages to build different parts of you data science projects.
有個老笑話,數據科學正在準備數據的80%和構建模型的20%。 這是事實。 我曾與數據科學家合作,他們花費大量時間在一個項目上,以使代碼在服務器而非桌面上正確運行。 我曾在一些較小的公司進行過采訪,這些公司希望數據科學家與數據工程師緊密合作,以建立數據結構來收集數據并支持其模型。 使用多種語言和軟件包來構建數據科學項目的不同部分會帶來巨大的好處。
Data science is 80% prepping data and 20% building models
數據科學正在準備數據的80%和構建模型的20%
There are a wealth of resources out there, and learning a new technology or programming language doesn’t have to be difficult or expensive. Sure there are professionally led trainings that can cost upwards of $1,500, but there are also free videos for almost anything, and students enjoy licenses and training for many expensive software packages for free just for having a .edu email address. Search on their website to see what is available. Once you learn one type of software or language, it usually isn’t too hard to get into another similar one. If you can use Power BI, Tableau isn’t too difficult to use.
那里有大量的資源,學習新技術或編程語言不一定是困難或昂貴的。 當然,由專業人士帶領的培訓費用可能高達$ 1,500,但幾乎所有內容都提供免費視頻,而且學生只要擁有.edu電子郵件地址,即可免費獲得許可和許多昂貴軟件包的培訓。 在他們的網站上搜索以查看可用的內容。 一旦學習了一種類型的軟件或語言,通常就不難進入另一種類似的軟件或語言。 如果可以使用Power BI,Tableau并不太難使用。
6.將所有內容組合到管道中 (6. Combine everything into a pipeline)
Photo by roman pentin on Unsplash 羅馬· 彭汀在Unsplash上的照片Speaking of multiple tools, why not use a bunch of them together? As a student there are many times you get your data in a simple format, a .csv for example. It isn’t impossible to get .csv files when working for an organization, but data isn’t typically stored that way. Data will come from multiple sources, have multiple layers for transformation and storage, and touch many different technologies along the way. A small pipeline for your project shows that you understand how these parts work together.
說到多種工具,為什么不一起使用它們呢? 作為一名學生,很多時候您會以一種簡單的格式(例如.csv)獲取數據。 在組織中工作時獲取.csv文件不是不可能的,但是數據通常不是那樣存儲的。 數據將來自多個來源,具有用于轉換和存儲的多層,并且在此過程中涉及許多不同的技術。 項目的一條小管道顯示您了解這些部分如何協同工作。
Everyone knows how to read a .csv file
每個人都知道如何讀取.csv文件
A pipeline can be as simple as loading files into a database, selecting the data to bring it into your analysis tool, and sending the results to your presentation layer. It doesn’t have to be pretty or fully automated, it just needs to demonstrate your understanding of how data pipelines work. Everyone knows how to read a .csv file, but not everyone can integrate the data, analysis, and presentation processes into a coherent project that reflects how organizations do it.
管道可以很簡單,例如將文件加載到數據庫中,選擇數據以將其帶入分析工具,然后將結果發送到表示層。 它不必完全漂亮或完全自動化,它只需要證明您對數據管道如何工作的理解即可。 每個人都知道如何讀取.csv文件,但不是每個人都可以將數據,分析和表示過程集成到一個反映組織工作方式的一致項目中。
摘要 (Summary)
This is by no means a comprehensive list, but gives you ideas about how to get your data science projects noticed! Keep in mind that the data science projects are a direct reflection of you skills. Showcase what you are good at by applying those skills to projects that reflect the complexity of real world data science. Be sure to use interesting data, clean it up, use version control to stay organized, effectively communicate your message, use varied technologies, and combine everything together for a truly rounded data science project.
這絕不是一個全面的列表,但可以為您提供有關如何使數據科學項目受到關注的想法! 請記住,數據科學項目是您技能的直接體現。 通過將這些技能應用到反映現實世界數據科學復雜性的項目中,展示您的擅長之處。 確保使用有趣的數據,對其進行清理,使用版本控制來保持井井有條,有效地傳達您的信息,使用各種技術并將所有內容組合在一起,從而構成一個真正全面的數據科學項目。
Keep an eye out for more articles about how I applied these ideas in a recent data science pipeline that I created! They will include my motivations for choosing a particular problem and issues I encountered along the way.
請留意更多有關我如何在最近創建的數據科學管道中應用這些想法的文章! 它們將包括我選擇特定問題的動機以及在此過程中遇到的問題。
翻譯自: https://towardsdatascience.com/6-ways-to-make-your-data-science-projects-stand-out-1eca46f5f76f
美團脫穎而出的經驗
總結
以上是生活随笔為你收集整理的美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 金融资产的特点 金融资产有什么特点
- 下一篇: hrytl00是什么型号