数据库:存储过程_数据科学过程:摘要
數(shù)據(jù)庫:存儲過程
Once you begin studying data science, you will hear something called ‘data science process’. This expression refers to a five stage process that usually data scientists perform when working on a project. In this post I will walk through each of them, describe what is involved and what technologies are normally used.
一旦開始學(xué)習(xí)數(shù)據(jù)科學(xué),您將聽到一種稱為“數(shù)據(jù)科學(xué)過程”的信息。 此表述是指數(shù)據(jù)科學(xué)家通常在執(zhí)行項目時執(zhí)行的五個階段的過程。 在這篇文章中,我將逐步介紹它們中的每一個,描述涉及的內(nèi)容和通常使用的技術(shù)。
1.數(shù)據(jù)采集 (1. Data Acquisition)
When you are just studying data science, your data may be already given to you by your instructors. Also, you can find a lot of beautiful datasets on Kaggle.com or Google Dataset Search. In this case data acquisition is pretty simple, just download the dataset and you’re all set to go.
當(dāng)您僅學(xué)習(xí)數(shù)據(jù)科學(xué)時,您的數(shù)據(jù)可能已經(jīng)由您的講師提供給您。 另外,您可以在Kaggle.com或Google數(shù)據(jù)集搜索上找到許多精美的數(shù)據(jù)集 。 在這種情況下,數(shù)據(jù)采集非常簡單,只需下載數(shù)據(jù)集即可。
In real life it is a little trickier. To obtain data in a format you need you will probably be using API’s or web scraping and your basic knowledge of HTML in order to obtain everything you need. In one of my earlier posts I described how I obtained the data about beauty products from Sephora.com using Selenium and BeautifulSoup.
在現(xiàn)實生活中,這有點棘手。 要獲取您需要的格式的數(shù)據(jù),您可能會使用API??或Web抓取以及HTML的基本知識來獲取所需的一切。 在我以前的一篇文章中,我描述了如何使用Selenium和BeautifulSoup從Sephora.com獲得有關(guān)美容產(chǎn)品的數(shù)據(jù)。
Technologies used: HTML, SQL, Selenium, BeautifulSoup.
使用的技術(shù):HTML,SQL,Selenium,BeautifulSoup。
2.數(shù)據(jù)清理 (2. Data Cleaning)
Again, if the dataset was already given to you by your instructors, or you got it on one of the websites mentioned above, there’s a good chance that your data is already clean. However, in most cases there will be some cleaning required. You need to handle the missing values (and be smart about it), make sure that all the columns are in correct datatypes (date-time, integers, floats, strings, etc.), all column names don’t contain spaces (especially important if you’re using NLP to perform analysis and modeling). Check out my post Beginner’s guide to data cleaning for more information.
同樣,如果數(shù)據(jù)集已經(jīng)由您的講師提供給您,或者您已在上述網(wǎng)站之一上獲得,則很有可能您的數(shù)據(jù)已經(jīng)清理干凈。 但是,在大多數(shù)情況下,需要進(jìn)行一些清潔。 您需要處理缺失的值(并對此有所了解),確保所有列的數(shù)據(jù)類型都正確(日期時間,整數(shù),浮點數(shù),字符串等),所有列名均不包含空格(尤其是空格)如果您要使用NLP進(jìn)行分析和建模,則非常重要)。 查看我的文章數(shù)據(jù)清理初學(xué)者指南以獲取更多信息。
Technologies used: Pandas, NumPy
使用的技術(shù):Pandas,NumPy
3. EDA (3. EDA)
EDA stands for Exploratory Data Analysis. At this stage of the process you need to get to know your data. What is the shape of the table? How many rows and columns there are? What are the data types (to make sure you cleaned properly)? How the numeric values are distributed? Is there some sort of correlation/multicollinearity? Is there class imbalance if you want to perform classification? You need to answer all these questions and more before you get to the next stage. I would just write down all the questions I have and try to answer them one by one. This stage is also very important if you are about to present the results to a non-technical audience. While exploring your data in a meaningful way, you will create beautiful visualizations. And someone with no background in math and coding will better respond to an interactive 3D map rather than to you saying “My adjusted R2 is 0.92!”.
EDA代表探索性數(shù)據(jù)分析。 在流程的此階段,您需要了解您的數(shù)據(jù)。 桌子的形狀是什么? 有多少行和幾列? 有哪些數(shù)據(jù)類型(以確保正確清理)? 數(shù)值如何分布? 有某種相關(guān)性/多重共線性嗎? 如果要進(jìn)行分類,是否存在班級失衡 ? 在進(jìn)入下一階段之前,您需要回答所有這些問題以及更多其他問題。 我只想寫下所有問題,然后嘗試一個接一個地回答。 如果您要向非技術(shù)人員介紹結(jié)果,那么此階段也非常重要。 在以有意義的方式瀏覽數(shù)據(jù)時,您將創(chuàng)建漂亮的可視化效果。 沒有數(shù)學(xué)和編碼背景的人會更好地響應(yīng)交互式3D地圖,而不是您說“我的調(diào)整后R2為0.92!”。
Screenshot from one of my project presentations我的項目演示之一的屏幕截圖Technologies used: Pandas, Numpy, Matplotlib, Seaborn, Plotly (GO and Express)
使用的技術(shù):熊貓,Numpy,Matplotlib,Seaborn,Plotly(GO和Express)
4.建模 (4. Modeling)
This is the most fun part (IMO). After all the preparation you get to create a machine learning/deep learning model that will make some sort of predictions. This can be a simple linear regression, multiple regression, classification, time series, NLP analysis, or a huge computer vision project with image recognition. Describing how each and every one of these works is beyond the scope of this post, but check out my earlier post about how to talk about regression with babies and I’m-really-bad-at-math people.
這是最有趣的部分(IMO)。 完成所有準(zhǔn)備工作后,您將創(chuàng)建一個可以進(jìn)行某種預(yù)測的機(jī)器學(xué)習(xí)/深度學(xué)習(xí)模型。 這可以是簡單的線性回歸,多元回歸,分類,時間序列,NLP分析或具有圖像識別功能的大型計算機(jī)視覺項目。 描述每種方法的工作方式超出了本文的范圍,但是請查閱我之前的文章 ,該文章介紹了如何與嬰兒和我真的很糟糕的人談?wù)摶貧w。
Technologies used: Scikit-Learn, SciPy, NumPy, Keras, Tensorflow, PyTorch, XGBoost, and many, many more (really depends on what you’re trying to model).
使用的技術(shù):Scikit-Learn,SciPy,NumPy,Keras,Tensorflow,PyTorch,XGBoost等(取決于您要建模的內(nèi)容)。
5.模型解釋與應(yīng)用 (5. Model Interpretation and Applications)
The results of your model are probably going to look something like this:
您的模型結(jié)果可能看起來像這樣:
Screenshot of my project: binary classification with XGBoost我的項目的屏幕截圖:使用XGBoost進(jìn)行二進(jìn)制分類What the heck does this all mean? You can’t just go to the investors and marketing department and say something like ‘my validation accuracy achieved 93% after I handled the class imbalance’ or ‘the proportion of the variance for a dependent variable y is explained by independent variables X by R-squared of 0.75’, you will immediately hear back “English, please!”.
這到底意味著什么? 您不能只去投資者和市場部門說“我在處理類不平衡問題后,我的驗證精度達(dá)到93%”或“因變量y的方差比例由自變量X乘以R來解釋”之類的說法平方為0.75',您將立即聽到“請英語!”的聲音。
The goal of the final stage of the data science process is to learn how to translate back from Math to English. It doesn’t matter how high or low your adjusted R2 or validation accuracy is if you can’t explain what it means in real life.
數(shù)據(jù)科學(xué)過程最后階段的目標(biāo)是學(xué)習(xí)如何從數(shù)學(xué)翻譯回英語。 如果您無法解釋現(xiàn)實生活中的含義,那么調(diào)整后的R2或驗證精度的高低無關(guān)緊要。
The results of this whole data science process can be wrapped up in a presentation or they can be used to build a useful web application or some other sort of software. You will need basic knowledge of web development to make it happen, but if I built an app in four days, you certainly can too! Here’s a post about how I did it.
整個數(shù)據(jù)科學(xué)過程的結(jié)果可以包裝在演示文稿中,也可以用于構(gòu)建有用的Web應(yīng)用程序或某種其他類型的軟件。 您需要具備Web開發(fā)的基礎(chǔ)知識才能實現(xiàn)它,但是如果我在四天內(nèi)構(gòu)建了一個應(yīng)用程序,您當(dāng)然也可以! 這是關(guān)于我如何做的帖子 。
Technologies used: Your knowledge of math for data interpretation, Flask and Dash for creating a front-end.
使用的技術(shù):您的數(shù)學(xué)知識可用于數(shù)據(jù)解釋,Flask和Dash可用于創(chuàng)建前端。
This is a quick summary of what a data science process looks like in a nutshell. Of course, there’s more to it in real life, but if you’re just learning, it’s a nice structure to stick to. Enjoy your data!
簡要概述了數(shù)據(jù)科學(xué)過程的外觀。 當(dāng)然,現(xiàn)實生活中還有很多其他方面,但是如果您只是學(xué)習(xí),那么這是一個值得堅持的好結(jié)構(gòu)。 享受您的數(shù)據(jù)!
翻譯自: https://medium.com/the-innovation/data-science-process-summary-865abd16183d
數(shù)據(jù)庫:存儲過程
總結(jié)
以上是生活随笔為你收集整理的数据库:存储过程_数据科学过程:摘要的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 一直梦到鬼怎么办
- 下一篇: cnn对网络数据预处理_CNN中的数据预