工程师的成熟模型_数据工程师的成熟度
工程師的成熟模型
數據科學與機器學習 (DATA SCIENCE AND MACHINE LEARNING)
What does a data engineer do?
數據工程師做什么?
Let’s start with three big wars that we need to understand before understanding what a data engineer does.
讓我們從理解數據工程師的工作之前需要理解的三場大戰開始。
Data mining, big data, and data pipeline.
數據挖掘,大數據和數據管道。
Data mining means pre-processing and extracting some knowledge from the data, so we use some data to extract knowledge. Big data contains lots of data and variables. Those data are enormous, and you need to have it running on cloud computing or multiple computers such as AWS1, Azure2, and Google Cloud3 because they have a lot of machines and storage to store that data.
數據挖掘意味著預處理并從數據中提取一些知識,因此我們使用一些數據來提取知識。 大數據包含大量數據和變量。 這些數據非常龐大,您需要使其在云計算或多臺計算機(例如AWS1,Azure2和GoogleCloud3)上運行,因為它們有很多計算機和存儲設備來存儲該數據。
Normally, Big Data is not stored in one machine. This usually happens because the dataset gets so big. Having data in a database like MySQL or Postgres or any database becomes complicated when it is in one single machine. New technologies were invented to solve this problem, like Hadoop? and NoSQL?.
通常,大數據不存儲在一臺計算機中。 這通常是因為數據集變得很大。 當數據放在一臺機器上時,將數據存儲在MySQL或Postgres這樣的數據庫中或任何數據庫中會變得很復雜。 發明了解決該問題的新技術,例如Hadoop?和NoSQL?。
A data pipeline is essentially a pipeline that a data engineer built. The fact, you need to extract information from this data using data mining. Data engineers need to make a pipeline that allows data to flow from unknown large amounts of data to a more useful form.
數據管道本質上是數據工程師構建的管道。 實際上,您需要使用數據挖掘從此數據中提取信息。 數據工程師需要建立一條管道,以使數據能夠從未知的大量數據流變為更有用的形式。
Data engineers essentially create a data pipeline where all the information comes from different devices like IoT devices, mobile applications, web apps, cameras, cars, and anything that collects data and stores information or logs data into servers or to the cloud.
數據工程師實質上是創建一條數據管道,其中所有信息都來自不同的設備,例如IoT設備,移動應用程序,Web應用程序,攝像頭,汽車以及任何收集數據并將信息存儲或將數據記錄到服務器或云中的東西。
Data engineers accumulate all this information into nicely packed databases and store engines so that different parts of the company can create visualizations. They can monitor the performance of their product, get business insights, make business decisions from this data, and even use this data on their apps — for example, for user profiles.
數據工程師將所有這些信息累積到包裝精美的數據庫中,并存儲引擎,以便公司的不同部門可以創建可視化。 他們可以監視產品的性能,獲得業務見解,根據這些數據制定業務決策,甚至可以在其應用程序上使用此數據(例如,用于用戶個人資料)。
Before a company is looking for a data scientist, machine learning expert, business intelligence, or data analyst, they need to hire a data engineer to build the pipeline. Data engineers bring in all the information organized to do data modeling. They help with the data collection part. Usually, a machine learning engineer or data scientist doesn’t have to be concerned about the data pipeline.
在公司尋找數據科學家,機器學習專家,商業智能或數據分析師之前,他們需要聘請數據工程師來構建管道。 數據工程師將組織的所有信息引入進行數據建模。 他們幫助數據收集部分。 通常,機器學習工程師或數據科學家不必擔心數據管道。
In practice, data engineers start with the process called data ingestion, which collects data from various sources and ingests them into what we call a data lake. A data lake is a collection of data. However, we don’t want the lake to overflow dry up. We need to perform something called data transformation, which is converting data from one format to another, usually into something we call a data warehouse.
實際上,數據工程師從稱為數據攝取的過程開始,該過程從各種來源收集數據并將其攝取到我們所謂的數據湖中。 數據湖是數據的集合 。 但是,我們不希望湖水溢出干dry。 我們需要執行一種稱為數據轉換的操作,即將數據從一種格式轉換為另一種格式,通常將其轉換為我們稱為數據倉庫的內容。
A data warehouse is a place that saves accessible data that is useful for the business. Before placing the data into a data warehouse, data engineers look into raw data and uses some parts of the data that are useful and then put it into a data warehouse so that other parts of the business can use it. We can assume that a data lake is a pool of raw data. That means that data lakes are usually less organized and have less filtration than something like a data warehouse.
數據倉庫是保存對企業有用的可訪問數據的地方。 在將數據放入數據倉庫之前,數據工程師會研究原始數據并使用一些有用的數據部分,然后將其放入數據倉庫中,以便業務的其他部分可以使用它。 我們可以假設數據湖是原始數據的池。 這意味著,與諸如數據倉庫之類的數據湖相比,數據湖通常組織性較低,過濾性也較低。
The question is, why would businesses want to do that?
問題是,為什么企業要這樣做?
It’s a lot easier to analyze data when it’s organized. We might have data in data lakes that we don’t need. However, we save on storage space in the data warehouse because we don’t have to store all the data and only store the data structure. Building data warehouse infrastructure is expensive; therefore, we can save money with this kind of data management.
整理數據后,分析數據要容易得多。 我們可能在不需要的數據湖中存儲了數據。 但是,我們節省了數據倉庫中的存儲空間,因為我們不必存儲所有數據,而只需存儲數據結構。 建立數據倉庫基礎架構非常昂貴; 因此,我們可以通過這種數據管理節省資金。
To review, a data engineer built this pipeline of taking the data production and data capture using data engineering practices to build this pipeline. So that data can now be analyzed by data scientists and data analysts.
回顧一下,數據工程師使用數據工程實踐構建了該數據采集和數據捕獲管道,以構建該管道。 因此,數據科學家和數據分析師現在可以分析數據。
What kind of tools do data engineers use?
數據工程師使用哪種工具?
You may have heard of Apache Kafka?, Hadoop?, Amazon S31, or Azure Data Lake2. These are programs that have been built by engineers to carry large amounts of data like a data lake. There are also tools like Google Big Query2, Amazon Redshift1, and Amazon Athena1. These are data warehouses that allow engineers to make queries or analyze the structure data.
您可能聽說過ApacheKafka?,Hadoop?,AmazonS31或Azure DataLake2。 這些程序是由工程師構建的,用于承載大量數據,例如數據湖。 還有一些工具,例如Google BigQuery2,AmazonRedshift1和AmazonAthena1。 這些是數據倉庫,允許工程師進行查詢或分析結構數據。
In this whole system, we’ve study that the data engineer creates this entire system for business. They use different tools and programs to ingest data and then put it into a data lake or a data warehouse. As a data scientist and a machine learning expert, which data do you use? Most of the time, you would be working with a data lake because if you’re doing machine learning, the more information you have, the better.
在整個系統中,我們研究了數據工程師為業務創建了整個系統。 他們使用不同的工具和程序來提取數據,然后將其放入數據湖或數據倉庫中。 作為數據科學家和機器學習專家,您使用哪些數據? 大多數時候,您將使用數據湖,因為如果您正在進行機器學習,則擁有的信息越多越好。
With machine learning, you can use structured or unstructured data to go into a data lake and grab a bunch of data to use for your models, whether in CSV forms or any other forms. Usually, data warehouses are used by business intelligence people or business analysts or data analysts to make visualization or analyze data because the data warehouse usually has more structured data that has been cleaned out.
通過機器學習,您可以使用結構化或非結構化數據進入數據湖,并獲取一堆數據以供模型使用,無論是CSV形式還是任何其他形式。 通常,商業情報人員或業務分析師或數據分析師使用數據倉庫進行可視化或分析數據,因為數據倉庫通常具有已清理的結構化數據。
As a data scientist, you can use the data from a data warehouse. This isn’t just a rule; it’s usually you use whatever data is useful to you. They use as much data that they can, as many valuable data as they can. In contrast, somebody like a business intelligence person or a data analyst already has the data cleaned processed by a data engineer and use something like a data warehouse to analyze data.
作為數據科學家,您可以使用數據倉庫中的數據。 這不只是一條規則; 通常您會使用對您有用的任何數據。 他們使用盡可能多的數據,并盡可能多地使用有價值的數據。 相反,諸如商業智能人員或數據分析師之類的人已經將數據工程師處理過的數據清理干凈,并使用諸如數據倉庫之類的東西來分析數據。
Something like Google’s Big Query2 precisely does that. It allows somebody with not much engineering experience or programming experience to analyze it in a data warehouse. Typically, software engineers, software developers, app developers, mobile developers build programs and apps that users and customers use. A data engineer would then make this piping and pipelining to ingest data and store it in different services like Hadoop? or Google Big Query2. Then the rest of the business can access data.
像Google的BigQuery2之類的工具正是這樣做的。 它允許沒有太多工程經驗或編程經驗的人在數據倉庫中進行分析。 通常,軟件工程師,軟件開發人員,應用程序開發人員,移動開發人員會構建用戶和客戶使用的程序和應用程序。 然后,數據工程師將進行這種管道傳輸以吸收數據并將其存儲在Hadoop in或Google BigQuery2等不同服務中。 然后,其余業務可以訪問數據。
We also have data scientists who use the data lake and the data scientists to extract information and deliver some business value. Finally, we have data analysts or business intelligence to use data warehouse or structured data to derive business value.
我們也有使用數據湖的數據科學家和數據科學家提取信息并提供一些業務價值。 最后,我們有數據分析師或商業智能來使用數據倉庫或結構化數據來獲取業務價值。
Nowadays, the industry is fast evolving, and there’s some overlap. Sometimes job descriptions might say be different from the other, but they are general simplified rules that you can use to understand how each role plays into the part of a company.
如今,該行業正在快速發展,并且存在一些重疊。 有時職務說明可能會彼此不同,但它們是一般的簡化規則,您可以用來了解每個角色如何參與公司的工作。
結論 (Conclusion)
There are three main tasks as data engineers. First, data engineers build an extract transform load pipeline, also known as ETL. Unlike data ingestion, which means moving data from one place to another, and ETL pipeline is the idea that a data engineer extracts data that has been generated by all of these systems. They extract data, and then they transform the data into a useful form that can be loaded into a data warehouse. So, data can be used by the rest of the company, and they used programming languages like Python, Go, Scala, and Java to accomplish these ETL jobs.
作為數據工程師,有三個主要任務。 首先,數據工程師構建提取轉換負載管道,也稱為ETL 。 與數據攝取不同,后者意味著將數據從一個地方移到另一個地方,而ETL管道的思想是數據工程師提取所有這些系統生成的數據。 他們提取數據,然后將數據轉換成可以加載到數據倉庫中的有用形式。 因此,數據可以由公司的其他部門使用,并且他們使用Python,Go,Scala和Java等編程語言來完成這些ETL作業。
Next, data engineers also build analysis tools to understand how company systems work. A data engineer needs to make sure that when any part of the system breaks, it is notified. Data engineers allow data scientists, data analysts, and business intelligence people to use tools to analyze the data and ensure that the system they’ve put in place is running correctly.
接下來,數據工程師還將構建分析工具,以了解公司系統的工作方式。 數據工程師需要確保在系統的任何部分發生故障時都會得到通知。 數據工程師允許數據科學家,數據分析師和商業智能人員使用工具來分析數據,并確保已安裝的系統正確運行。
Finally, their third main task is obviously to maintain the data warehouse and data lakes, which is making sure that everything in there is accessible for other parts of the companies to use.
最后,他們的第三個主要任務顯然是維護數據倉庫和數據湖,這將確保其中的所有內容可供公司的其他部分使用。
Now, you have a high-level overview of what a data engineer does. However, this landscape is fast changing because new tools are always popping up. So my advice is don’t take as the absolute must know for all data engineers instead see that they exist. Furthermore, it looks like the role of data engineers will be replaced by data scientists. Go and read some of their documentation, only learn or use them once the need arises because they’re regularly updated, and the world of data engineering is fast-paced right now.
現在,您對數據工程師的工作有了一個高層次的概述。 但是,這種情況正在快速變化,因為總是彈出新工具。 因此,我的建議不成立,因為所有數據工程師都必須了解這些絕對知識,而應看到它們的存在。 此外,看起來數據工程師的角色將由數據科學家取代。 去閱讀他們的一些文檔,僅在需要時才學習或使用它們,因為它們會定期更新,并且數據工程的世界現在是快節奏的。
關于作者 (About the Author)
Wie Kiang is a researcher who is responsible for collecting, organizing, and analyzing opinions and data to solve problems, explore issues, and predict trends.
Wie Kiang是一名研究人員,負責收集,組織和分析意見和數據以解決問題,探索問題和預測趨勢。
He is working in almost every sector of Machine Learning and Deep Learning. He is carrying out experiments and investigations in a range of areas, including Convolutional Neural Networks, Natural Language Processing, and Recurrent Neural Networks.
他幾乎在機器學習和深度學習的每個領域工作。 他正在許多領域進行實驗和研究,包括卷積神經網絡,自然語言處理和遞歸神經網絡。
#1 Amazon Web Services#2 Microsoft Azure
#3 Google Cloud
#4 Apache Hadoop
#5 NoSQL
#6 Apache Kafka
翻譯自: https://towardsdatascience.com/the-maturity-of-data-engineers-2c9e2bfcee09
工程師的成熟模型
總結
以上是生活随笔為你收集整理的工程师的成熟模型_数据工程师的成熟度的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 两个链接合并_如何找到两个链接列表的合并
- 下一篇: 孕期梦到鱼什么意思