大数据技术 学习之旅_数据-数据科学之旅的起点
大數(shù)據(jù)技術(shù) 學(xué)習(xí)之旅
什么是數(shù)據(jù)科學(xué)? (What is Data Science?)
The interesting thing about Data Science is that it is a young area and the definitions can differ from textbooks to newspapers to whitepapers. The general definition is that Data Science is a mixture of multiple tools, algorithms, and machine learning principles, in order to discover hidden patterns in the data. How is this different from statistics that have existed and have been used for years? The answer lies in the difference between explanation and prediction.
數(shù)據(jù)科學(xué)的有趣之處在于,它是一個年輕的領(lǐng)域,其定義從教科書到報(bào)紙?jiān)俚桨灼伎赡苡兴煌?一般的定義是,數(shù)據(jù)科學(xué)是多種工具,算法和機(jī)器學(xué)習(xí)原理的結(jié)合,以便發(fā)現(xiàn)數(shù)據(jù)中的隱藏模式。 這與已有且使用多年的統(tǒng)計(jì)數(shù)據(jù)有何不同? 答案在于解釋和預(yù)測之間的差異。
數(shù)據(jù)科學(xué)過程 (The data science process)
Data science is composed of seven main steps. Each one of them is important for the accuracy of the model. Let’s see what is contained in each step.
數(shù)據(jù)科學(xué)由七個主要步驟組成。 其中每個對于模型的準(zhǔn)確性都很重要。 讓我們看看每個步驟中包含的內(nèi)容。
業(yè)務(wù)了解 (Business understanding)
If we want to create a data science project, we need to understand the problem that we are trying to solve. So, in this step we have to get answers to the following questions:
如果要創(chuàng)建數(shù)據(jù)科學(xué)項(xiàng)目,則需要了解我們要解決的問題。 因此,在這一步中,我們必須獲得以下問題的答案:
- How many?
- 多少?
- Which category?
-哪個類別?
- Which group?
-哪一組?
- Is this strange?
-奇怪嗎?
- Which option should be considered?
-應(yīng)該考慮哪個選項(xiàng)?
Based on the answers to these questions, we can conclude which variable / variables should be predicted.
根據(jù)這些問題的答案,我們可以得出結(jié)論:應(yīng)該預(yù)測哪個變量。
數(shù)據(jù)挖掘 (Data Mining)
The next step is finding the right data. Data mining is a process of finding and collecting data from different sources. We need to answer the following questions:
下一步是找到正確的數(shù)據(jù)。 數(shù)據(jù)挖掘是從不同來源查找和收集數(shù)據(jù)的過程。 我們需要回答以下問題:
- Which data is needed for the project?
-該項(xiàng)目需要哪些數(shù)據(jù)?
- Where can I find that data?
-在哪里可以找到這些數(shù)據(jù)?
- How to obtain the data?
-如何獲取數(shù)據(jù)?
- Which is the most effective way of storing and accessing the data?
-哪種存儲和訪問數(shù)據(jù)最有效的方法?
If the data is in one place — this process will be easy for us. Usually, this is not the case.
如果數(shù)據(jù)在一個地方,對我們來說這個過程很容易。 通常情況并非如此。
數(shù)據(jù)清理 (Data Cleaning)
This is the most complicated step and it takes 50 to 80 percent of the time. After the data is collected, we must clean it. The data might contain missing values, or it might be inconsistent in one column. That is why we need to clean and organize our data.
這是最復(fù)雜的步驟,需要50%到80%的時間。 收集數(shù)據(jù)后,我們必須對其進(jìn)行清理。 數(shù)據(jù)可能包含缺少的值,或者在一列中可能不一致。 這就是為什么我們需要清理和整理數(shù)據(jù)。
數(shù)據(jù)探索 (Data Exploration)
After the data is cleaned, we will try to find a hidden pattern in it. This step includes extracting a subset, analyzing, and visualizing the subset. After this, we get a complete image behind every data point.
清除數(shù)據(jù)后,我們將嘗試在其中查找隱藏的模式。 此步驟包括提取一個子集,分析和可視化該子集。 此后,我們將在每個數(shù)據(jù)點(diǎn)后面獲得完整的圖像。
特征工程 (Feature Engineering)
In machine learning, a feature is explained as an attribute of a phenomenon that is observed. For example, if we are observing the results of a student — a possible attribute might be the amount of sleep that the student gets. This step is divided into two sub-steps. The first one is the Feature selection. In this step, we can remove some features in order to reduce the dimensionality that might cause complexity of the model. Also, the feature that we want to remove usually brings more noise than useful information. The second sub-step is Feature construction — this means that we can build a new feature based on the ones that we have.
在機(jī)器學(xué)習(xí)中,將特征解釋為觀察到的現(xiàn)象的屬性。 例如,如果我們觀察一個學(xué)生的成績,則可能的屬性可能是該學(xué)生獲得的睡眠量。 此步驟分為兩個子步驟。 第一個是功能選擇。 在此步驟中,我們可以刪除一些功能以降低可能導(dǎo)致模型復(fù)雜性的維數(shù)。 此外,我們要刪除的功能通常會帶來比有用信息更多的噪音。 第二個子步驟是功能構(gòu)建-這意味著我們可以根據(jù)已有功能構(gòu)建新功能。
預(yù)測建模 (Predictive modeling)
This is the step when we finally build out the model. Here we decide which model we will use — based on the answers that we obtained in the first step. This is not an easy decision and there is not always one answer. The model and its accuracy depend on the data — how big the data is, the type of the data and also the quality of the data. After the model is trained, we must evaluate the accuracy and determine if the model is successful.
這是我們最終構(gòu)建模型的步驟。 在這里,我們根據(jù)第一步獲得的答案來決定將使用哪種模型。 這不是一個容易決定的決定,而且也不總是一個答案。 模型及其準(zhǔn)確性取決于數(shù)據(jù)-數(shù)據(jù)的大小,數(shù)據(jù)的類型以及數(shù)據(jù)的質(zhì)量。 訓(xùn)練模型后,我們必須評估準(zhǔn)確性并確定模型是否成功。
數(shù)據(jù)可視化 (Data Visualization)
After we have obtained the information from the model, we need to visualize them in different ways in order to be understood by everyone included in the project.
從模型中獲得信息后,我們需要以不同的方式對其進(jìn)行可視化,以便項(xiàng)目中的每個人都能理解。
業(yè)務(wù)了解 (Business understanding)
Once everything is done, we return to the first step and check if the model meets the initial requirements. If we came across new insights during the first iteration of the life cycle (and I am sure that we will), we can now enter that knowledge into the next iteration to generate even more powerful insights and unleash the power of data to extract phenomenal results for the project.
一切完成后,我們返回第一步,檢查模型是否符合初始要求。 如果我們在生命周期的第一次迭代中遇到了新的見解(并且我肯定會),那么我們現(xiàn)在就可以將知識輸入到下一個迭代中,以生成更強(qiáng)大的見解,并釋放數(shù)據(jù)的力量以提取驚人的結(jié)果該項(xiàng)目。
什么是數(shù)據(jù)? (What is data?)
We can see that almost every step needs data. We can see that four out of five steps in the previous part are data related. So, we can assume that the data plays a crucial role in a data science project. What is data? How the data is defined? This might seem like an unimportant definition to look at, but it is. Whenever we use the word “data,” we refer to a collection of information in either an organized or unorganized format.
我們可以看到幾乎每個步驟都需要數(shù)據(jù)。 我們可以看到,上一部分中的五個步驟中有四個與數(shù)據(jù)相關(guān)。 因此,我們可以假設(shè)數(shù)據(jù)在數(shù)據(jù)科學(xué)項(xiàng)目中起著至關(guān)重要的作用。 什么是數(shù)據(jù)? 數(shù)據(jù)如何定義? 這看起來似乎是一個不重要的定義,但是確實(shí)如此。 每當(dāng)我們使用“數(shù)據(jù)”一詞時,我們指的是有組織或無組織格式的信息集合。
基本數(shù)據(jù)類型 (Basic types of data)
There are two types of formats based on the definition in the previous part:
根據(jù)上一部分的定義,有兩種格式:
o Structured (organized) data: Data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation
o 結(jié)構(gòu)化(組織)數(shù)據(jù):排序?yàn)樾?列結(jié)構(gòu)的數(shù)據(jù),其中每一行代表一個觀察值,列代表該觀察值的特征
o Unstructured (unorganized) data: Data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.
o 非結(jié)構(gòu)化(非組織)數(shù)據(jù):自由格式的數(shù)據(jù),通常是文本或原始音頻/信號,必須進(jìn)一步解析才能變得有組織。
When we talk about data, the first thing that we need to answer is whether the data is quantitative or qualitative. When we talk about quantitative data, we usually think about a structured dataset. These two data types can be defined as follows:
在談?wù)摂?shù)據(jù)時,我們需要回答的第一件事是數(shù)據(jù)是定量的還是定性的。 當(dāng)談?wù)摱繑?shù)據(jù)時,我們通常會考慮結(jié)構(gòu)化數(shù)據(jù)集。 這兩種數(shù)據(jù)類型可以定義如下:
o Quantitative data: When the data can be described using numbers, and basic mathematical operations, including addition, are possible on the set.
o 定量數(shù)據(jù):當(dāng)可以使用數(shù)字描述數(shù)據(jù)時,可以在集合上進(jìn)行包括加法在內(nèi)的基本數(shù)學(xué)運(yùn)算。
o Qualitative data: When the data cannot be described using numbers and basic mathematics. This data is generally being described using natural categories and language.
o 定性數(shù)據(jù):當(dāng)無法使用數(shù)字和基本數(shù)學(xué)描述數(shù)據(jù)時。 通常使用自然類別和語言來描述此數(shù)據(jù)。
定量數(shù)據(jù) (Quantitative data)
Quantitative data can be:
定量數(shù)據(jù)可以是:
o Discrete data: This describes data that is counted. It can only take on certain values. Examples of discrete quantitative data include a dice roll, because it can only take on six values, and the number of customers in a coffee shop because you can’t have a real range of people.
o 離散數(shù)據(jù):這描述了計(jì)數(shù)的數(shù)據(jù)。 它只能采用某些值。 離散定量數(shù)據(jù)的示例包括骰子擲骰(因?yàn)樗荒苋×鶄€值),以及咖啡店的顧客數(shù)量(因?yàn)槟鷽]有真正的人脈)。
o Continuous data: This describes data that is measured. It exists on an infinite range of values.
o 連續(xù)數(shù)據(jù):這描述了要測量的數(shù)據(jù)。 它存在于無限范圍的值中。
數(shù)據(jù)的四個層次 (The four levels of data)
It is generally understood that a specific characteristic (feature/column) of structured data can be broken into four levels of data. These levels are the following:
通常可以理解,結(jié)構(gòu)化數(shù)據(jù)的特定特征(特征/列)可以分為四個數(shù)據(jù)級別。 這些級別如下:
o The nominal level
o標(biāo)稱水平
o The ordinal level
o順序級別
o The interval level
o間隔等級
o The ratio level
o比率水平
Let’s go deeper into each level and explain each one of them.
讓我們更深入地介紹每個級別,并解釋每個級別。
名義水平 (The nominal level)
This level contains data that is described by name or category. For example, gender, name, species, and so on. The data cannot be described using numbers, so it is qualitative data and because of this we cannot perform mathematical operations such as addition or division on this data. The operations that we can perform on this level are equality and set membership function. Also, we cannot use the measure of center — a measure of center is explained as a number that shows us what the data tends to, and sometimes it is called a balance point of the data. Why we cannot use the measure of center? The explanation is simple — usually, when we use this measure we use the mode, median or the mean value. But, at the nominal level we cannot use mathematical operations, so these measures do not make sense. In conclusion, this level is composed of categorical data and we must be careful with this data — since it might contain very useful insights for us.
此級別包含按名稱或類別描述的數(shù)據(jù)。 例如,性別,名稱,物種等。 不能使用數(shù)字來描述數(shù)據(jù),因此它是定性數(shù)據(jù),因此,我們無法對此數(shù)據(jù)執(zhí)行數(shù)學(xué)運(yùn)算,例如加法或除法。 我們可以在此級別上執(zhí)行的操作是相等性和設(shè)置成員資格函數(shù)。 另外,我們不能使用中心度量-中心度量被解釋為一個數(shù)字,向我們顯示數(shù)據(jù)趨向于什么,有時也稱為數(shù)據(jù)的平衡點(diǎn)。 為什么我們不能使用中心度量? 解釋很簡單-通常,當(dāng)我們使用此度量時,我們使用眾數(shù),中位數(shù)或平均值。 但是,在名義上,我們不能使用數(shù)學(xué)運(yùn)算,因此這些度量沒有意義。 總之,此級別由分類數(shù)據(jù)組成,我們必須謹(jǐn)慎使用此數(shù)據(jù)-因?yàn)樗赡馨瑢ξ覀兎浅S杏玫囊娊狻?
順序級別 (The ordinal level)
The nominal level is not very flexible when we talk about mathematical operations. The data in the ordinal level provides a rank order, but we still cannot use more complex mathematical operations — like subtraction or addition in order to get a real meaning. For example, the grades from 1–10 are ordinal data — if we want to use addition, we won’t get any useful information from this. Another example is a survey result. At this level, we have more freedom with mathematical operations than in the nominal. The mathematical operations from the nominal level (equality and set membership) are inherited, and the additional operations that are allowed are ordering and comparison. At the ordinal level, the median is usually an appropriate way of defining the center of the data, but we can use the mode as well. The mean, however, would be impossible because the division is not allowed at this level.
當(dāng)我們談?wù)摂?shù)學(xué)運(yùn)算時,名義水平不是很靈活。 順序級別的數(shù)據(jù)提供了排名順序,但是我們?nèi)匀徊荒苁褂酶鼜?fù)雜的數(shù)學(xué)運(yùn)算(例如減法或加法)以獲得真實(shí)含義。 例如,從1到10的等級是序數(shù)數(shù)據(jù)—如果我們要使用加法,則將無法從中獲得任何有用的信息。 另一個例子是調(diào)查結(jié)果。 在此級別上,我們在數(shù)學(xué)運(yùn)算方面的自由度比名義上更大。 繼承了名義級別(相等和集合成員)的數(shù)學(xué)運(yùn)算,并且允許的其他運(yùn)算是排序和比較。 在順序級別上,中位數(shù)通常是定義數(shù)據(jù)中心的一種合適方法,但是我們也可以使用該模式。 但是,均值將是不可能的,因?yàn)樵诖思墑e不允許進(jìn)行除法運(yùn)算。
間隔等級 (The interval level)
Now, we are getting at a level where the data can be expressed through mean and we can use more complicated mathematical formulas. Data at the interval level support subtraction between data points. For example, data that contains temperature belongs to the interval level. The operations from the lower levels (ordering, comparisons, and so on), are inherited and the additional operations that are allowed are addition and subtraction. When we talk about the measure of center, we can use the median, the mode or the mean value — and usually, the most accurate description of the center would be the arithmetic mean. Let’s look at an example. We are trying to find the measure of center using data that contains temperatures of a fridge in which vaccines are stored. The optimal temperature must be under 29 degrees. After finding the mean and the median, we assumed both of them are near to 31 — so this is not acceptable for our dataset. This is the point when we need another measure — the measure of variance or standard deviation. We can use this measure if we want to see how our data is spread out. If we want to find the measure of variance, we need to calculate the mean, subtract each point from the mean, find the average of each square difference and take the square root. Here is the formula:
現(xiàn)在,我們可以達(dá)到可以通過均值表示數(shù)據(jù)的水平,并且可以使用更復(fù)雜的數(shù)學(xué)公式。 間隔級別的數(shù)據(jù)支持?jǐn)?shù)據(jù)點(diǎn)之間的減法。 例如,包含溫度的數(shù)據(jù)屬于間隔級別。 較低級別的操作(排序,比較等)將被繼承,而允許的其他操作為加法和減法。 當(dāng)我們談?wù)撝行牡亩攘繒r,我們可以使用中位數(shù),眾數(shù)或平均值-通常,最準(zhǔn)確的中心描述將是算術(shù)平均值。 讓我們來看一個例子。 我們正在嘗試使用包含存儲疫苗的冰箱溫度的數(shù)據(jù)來找到中心度量。 最佳溫度必須低于29度。 找到均值和中位數(shù)后,我們假設(shè)它們均接近31,因此這對于我們的數(shù)據(jù)集是不可接受的。 在這一點(diǎn)上,我們需要另一種度量-方差或標(biāo)準(zhǔn)偏差的度量。 如果要查看數(shù)據(jù)如何分布,可以使用此度量。 如果要找到方差的度量,則需要計(jì)算平均值,從平均值中減去每個點(diǎn),找到每個平方差的平均值,然后取平方根。 這是公式:
If we use this formula on the example with the temperatures, we can calculate the standard deviation on our dataset, and based on this measure we can see that the temperature might go down (mean minus standard deviation).
如果在帶有溫度的示例中使用此公式,則可以在數(shù)據(jù)集上計(jì)算標(biāo)準(zhǔn)偏差,并且基于此度量,我們可以看到溫度可能會下降(平均值減去標(biāo)準(zhǔn)偏差)。
比例等級 (The ratio level)
The last level is called the ratio level. There are not a lot of differences between the ratio and the interval level — sometimes we might be confused about which one is the right one. At the interval level, we don’t have a natural starting point or a natural zero, but in the ratio level — we have. The mathematical operations from the lower level are inherited and the additional ones are multiplication and division. For example, money in a bank account are classified in this level — one bank account can have a natural zero. As a measure of center, we can use the geometric mean — it is the square root of the product of all the values. The data at this level should be non-negative so that is why this level is not preferred.
最后一個級別稱為比率級別。 比率和間隔水平之間沒有太多差異-有時我們可能會混淆哪個是正確的。 在時間間隔級別,我們沒有自然的起點(diǎn)或自然的零,但是在比率級別上,我們有。 較低級別的數(shù)學(xué)運(yùn)算是繼承的,附加的運(yùn)算是乘法和除法。 例如,銀行帳戶中的錢在此級別中分類-一個銀行帳戶可以具有自然零值。 作為中心的度量,我們可以使用幾何平均值-它是所有值的乘積的平方根。 此級別的數(shù)據(jù)應(yīng)為非負(fù)數(shù),這就是為什么不首選此級別的原因。
結(jié)論 (Conclusion)
Data science can add values to any business — the important thing is to use the data well. Also, Data science can help us make better decisions based on measurable evidence. Data should always be available to us when making decisions. Using data science methodologies, we can research historical data, make comparisons with the competition, analyze the market, and most importantly, make recommendations on how the product or service would perform best. These analyzes, which are part of data science, provide deep knowledge and understanding of the market as well as their feedback on the product or service. It is estimated that about 2.5 billion gigabytes of data are generated daily. With this increase in the amount of data, getting what is important for the target group can be difficult. Every piece of data that a company collects from customers — whether it is social media likes, website visits or email surveys — contains data that can be analyzed to understand customers more effectively. This means that the services and products of certain groups can be customized. For example, finding correlations between age and income can help a company create new promotions or offers for groups that may not have been available before.
數(shù)據(jù)科學(xué)可以為任何業(yè)務(wù)增加價值,重要的是要充分利用數(shù)據(jù)。 同樣,數(shù)據(jù)科學(xué)可以幫助我們基于可衡量的證據(jù)做出更好的決策。 決策時,數(shù)據(jù)應(yīng)始終可供我們使用。 使用數(shù)據(jù)科學(xué)方法,我們可以研究歷史數(shù)據(jù),與競爭對手進(jìn)行比較,分析市場,最重要的是就產(chǎn)品或服務(wù)的最佳性能提出建議。 這些分析是數(shù)據(jù)科學(xué)的一部分,可提供對市場的深入了解和了解,以及對產(chǎn)品或服務(wù)的反饋。 據(jù)估計(jì)每天大約產(chǎn)生25億千兆字節(jié)的數(shù)據(jù)。 隨著數(shù)據(jù)量的增加,獲取對于目標(biāo)群體重要的數(shù)據(jù)可能會很困難。 公司從客戶那里收集的每條數(shù)據(jù)(無論是喜歡的社交媒體,網(wǎng)站訪問還是電子郵件調(diào)查)都包含可以進(jìn)行分析以更有效地了解客戶的數(shù)據(jù)。 這意味著可以定制某些組的服務(wù)和產(chǎn)品。 例如,發(fā)現(xiàn)年齡和收入之間的相關(guān)性可以幫助公司為以前可能沒有的團(tuán)體創(chuàng)建新的促銷或優(yōu)惠。
If you are interested in this topic, do not hesitate to contact me.
如果您對此主題感興趣,請隨時與我聯(lián)系。
LinkedIn profile: https://www.linkedin.com/in/ceftimoska/
領(lǐng)英簡介: https : //www.linkedin.com/in/ceftimoska/
翻譯自: https://towardsdatascience.com/data-the-starting-point-of-a-data-science-journey-f7880f9f0eb7
大數(shù)據(jù)技術(shù) 學(xué)習(xí)之旅
總結(jié)
以上是生活随笔為你收集整理的大数据技术 学习之旅_数据-数据科学之旅的起点的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: "双原生 ISO"
- 下一篇: 微软即将停售 Win10 产品密钥 /