什么是数据仓库,何时以及为什么要考虑一个
The term “Data Warehouse” is widely used in the data analytics world, however, it’s quite common for people who are new with data analytics to ask the above question.
術(shù)語“數(shù)據(jù)倉庫”在數(shù)據(jù)分析領(lǐng)域中被廣泛使用,但是,對于數(shù)據(jù)分析新手來說,問上述問題非常普遍。
This post attempts to help explain the definition of a data warehouse, when, and why to consider setting up one.
這篇文章試圖幫助解釋數(shù)據(jù)倉庫的定義,何時以及為什么考慮建立一個數(shù)據(jù)倉庫。
Ps: This is a section of a guidebook our team is writing, The Analytics Set-up Guidebook. If you are interested to learn more about the high-level or best practices of modern BI stacks, feel free to check out the link to see our progress.
附:這是我們團隊正在編寫的指南的一部分, 即Google Analytics(分析)設(shè)置指南 。 如果您想了解有關(guān)現(xiàn)代BI堆棧的高級或最佳實踐的更多信息,請隨時查看鏈接以了解我們的進展。
什么是數(shù)據(jù)倉庫? (What is a data warehouse?)
A data warehouse is a type of analytics database that stores and processes your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.
數(shù)據(jù)倉庫是一種分析數(shù)據(jù)庫,用于存儲和處理數(shù)據(jù)以進行分析。 數(shù)據(jù)倉庫將處理分析的兩個主要功能: 存儲分析數(shù)據(jù)和處理分析數(shù)據(jù) 。
Why do you need one? You will need a data warehouse for two main purposes:
為什么需要一個? 您將需要一個數(shù)據(jù)倉庫來實現(xiàn)兩個主要目的:
Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:
數(shù)據(jù)倉庫是分析管道流程中每個步驟的核心,它具有三個主要目的:
Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.
存儲:在合并(提取和加載)步驟中,您的數(shù)據(jù)倉庫將接收和存儲來自多個來源的數(shù)據(jù) 。
Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.
流程:在流程(“轉(zhuǎn)換和模型”)步驟中,您的數(shù)據(jù)倉庫將處理從轉(zhuǎn)換步驟生成的大部分(如果不是全部) 密集處理 。
Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end-users.
訪問:在報告(可視化和交付)步驟中,首先在數(shù)據(jù)倉庫中收集報告,然后將其可視化并交付給最終用戶。
At the moment, most data warehouses use SQL as their primary querying language.
目前,大多數(shù)數(shù)據(jù)倉庫都使用SQL作為其主要查詢語言。
什么時候是合適的時間來獲取數(shù)據(jù)倉庫? (When is the right time to get a data warehouse?)
The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.
TL; DR的答案取決于它。 這取決于公司的階段,擁有的數(shù)據(jù)量,預算等。
At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (A simple BI setup for people just starting out).
在早期階段,您可能不需要數(shù)據(jù)倉庫就可以解決問題,并將商業(yè)智能(BI)工具直接連接到生產(chǎn)數(shù)據(jù)庫( 針對剛開始的人們的簡單BI設(shè)置 )。
However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:
但是,如果仍然不確定數(shù)據(jù)倉庫是否適合您的公司,請考慮以下幾點:
First, do you need to analyze data from different sources?
首先,您需要分析來自不同來源的數(shù)據(jù)嗎?
At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.
在公司生命中的某個時刻,您需要合并來自不同內(nèi)部工具的數(shù)據(jù),以便做出更好,更明智的業(yè)務決策。
For instance, if you’re a restaurant and want to analyze orders/waitress ratio efficiency (which hour of the week the staff is most busy vs most free), you need to combine your sales data (from POS system) with your staff duty data (from HR system).
例如,如果您是一家餐館,并且想分析訂單/女服務員的工作效率(工作人員在一周中的哪一小時最忙與最空閑),則需要將銷售數(shù)據(jù)(來自POS系統(tǒng))與工作人員的職責結(jié)合起來數(shù)據(jù)(來自HR系統(tǒng))。
For those analyses, it is a lot easier to do if your data is located in one central location.
對于這些分析,如果您的數(shù)據(jù)位于一個中央位置,則操作會容易得多。
Second, do you need to separate your analytical data from your transactional data?
其次,您是否需要將分析數(shù)據(jù)與交易數(shù)據(jù)分開?
As mentioned, your transactional systems are not designed for analytical purposes. So if you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly.
如前所述,您的交易系統(tǒng)并非為分析目的而設(shè)計。 因此,如果您在應用程序中收集活動日志或其他可能有用的信息,將這些數(shù)據(jù)存儲在應用程序的數(shù)據(jù)庫中并讓分析師直接在生產(chǎn)數(shù)據(jù)庫上工作可能不是一個好主意。
Instead, it’s a much better idea to purchase a data warehouse — one that’s designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.
取而代之的是,購買一個數(shù)據(jù)倉庫(一個專門用于復雜查詢的數(shù)據(jù)倉庫)并在那里傳輸分析數(shù)據(jù)是一個更好的主意。 這樣,您的應用程序的性能就不會受到分析工作的影響。
Third, is your original data source not suitable for querying?
第三,您的原始數(shù)據(jù)源是否不適合查詢?
For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.
例如,絕大多數(shù)BI工具不能與MongoDB等NoSQL數(shù)據(jù)存儲一起很好地工作。 這意味著在后端使用MongoDB的應用程序需要將其分析數(shù)據(jù)傳輸?shù)綌?shù)據(jù)倉庫,以便數(shù)據(jù)分析人員有效地使用它。
Fourth, do you want to increase the performance of your analytical queries?
第四,您是否想提高分析查詢的性能?
If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.
如果您的事務數(shù)據(jù)包含成千上萬的行,那么創(chuàng)建匯總表以將該數(shù)據(jù)聚合為更可查詢的形式可能是一個好主意。 如果不這樣做,將導致查詢異常緩慢-更不用說使查詢成為數(shù)據(jù)庫的不必要負擔。
If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.
如果您對以上任何一個問題的回答為“是”,那么很可能就應該獲得一個數(shù)據(jù)倉庫。
That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.
也就是說,在我們看來,最好是先獲得一個數(shù)據(jù)倉庫,因為在云時代,數(shù)據(jù)倉庫并不昂貴。
我應該選擇哪個數(shù)據(jù)倉庫? (Which Data Warehouse Should I Pick?)
Here are some common data warehouses that you may pick from:
以下是一些常見的數(shù)據(jù)倉庫,您可以從中選擇:
- Amazon Redshift 亞馬遜Redshift
- Google BigQuery Google BigQuery
- Snowflake 雪花
- ClickHouse (self-hosted) ClickHouse(自托管)
- Presto (self-hosted) Presto(自托管)
If you’re just getting started and don’t have a strong preference, we suggest that you go with Google BigQuery for the following reasons:
如果您只是入門而又沒有強烈的偏好,建議您使用Google BigQuery,原因如下:
BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.
BigQuery對前10GB的存儲空間和前1TB的查詢免費 。 之后是按使用量付費。
BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.
BigQuery是完全托管(無服務器)的 :沒有物理(或虛擬)服務器可以啟動或管理。
As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.
作為其架構(gòu)的結(jié)果, BigQuery會自動擴展: BigQuery將自動確定要分配給每個查詢的正確計算資源量,具體取決于查詢的復雜性和您掃描的數(shù)據(jù)量,而無需手動進行微調(diào)。
(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).
(請注意:我們與Google沒有任何隸屬關(guān)系,并且沒有獲得宣傳BigQuery的報酬)。
However, if you have a rapidly increasing volume of data, or if you have complex/special use cases, you will need to carefully evaluate your options.
但是,如果您的數(shù)據(jù)量Swift增加,或者您有復雜/特殊的用例,則需要仔細評估您的選擇。
Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.
下面,我們列出了最受歡迎的數(shù)據(jù)倉庫。 我們的目的是讓您對數(shù)據(jù)倉庫空間中最常見的選擇有一個高層次的了解。 這絕不是全面的,也不足以幫助您做出明智的決定。
But it is, we think, a good start:
但是,我們認為這是一個不錯的開始:
是什么使數(shù)據(jù)倉庫與普通SQL數(shù)據(jù)庫不同? (What makes a data warehouse different from normal SQL database?)
At this point some of you might be asking:
此時,有些人可能會問:
“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”
“嘿,數(shù)據(jù)倉庫不像存儲數(shù)據(jù)以供分析的關(guān)系數(shù)據(jù)庫嗎? 我不能只使用MySQL,PostgreSQL,MSSQL或Oracle之類的數(shù)據(jù)倉庫嗎?”
The short answer is: yes you can.
簡短的答案是:是的,您可以。
The long answer is: it depends. First, we need to understand a few concepts.
長的答案是:這取決于。 首先,我們需要了解一些概念。
事務性工作量與分析性工作量 (Transactional Workloads vs Analytical Workloads)
It is important to understand the difference between two kinds of database workloads: transactional workloads and analytical workloads.
重要的是要了解兩種數(shù)據(jù)庫工作負載之間的區(qū)別:事務性工作負載和分析性工作負載。
A transactional workload is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the database to fetch this product, and return the result to the application for processing.
事務性工作負載是為正常業(yè)務應用程序提供服務的查詢工作負載。 當訪問者在Web應用程序中加載產(chǎn)品頁面時,查詢將發(fā)送到數(shù)據(jù)庫以獲取此產(chǎn)品,并將結(jié)果返回給應用程序進行處理。
SELECT * FROM products WHERE id = 123(the query above retrieves information for a single product with ID 123)
(上面的查詢檢索ID為123的單個產(chǎn)品的信息)
Here are several common attributes of transactional workloads:
以下是事務性工作負載的幾個常見屬性:
Each query usually retrieves a single record or a small number of records (e.g. get the first 10 blog posts in a category)
每個查詢通常檢索單個記錄或少量記錄(例如,獲取類別中的前10個博客帖子)
Transactional workloads typically involve simple queries that take a very short time to run (less than 1 second)
事務性工作負載通常涉及需要很短時間 (不到1秒) 運行的簡單查詢
Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big websites, this can go to the thousands or hundreds of thousands.
在任何時間點都有很多并發(fā)查詢 ,受應用程序并發(fā)訪問者數(shù)量的限制。 對于大型網(wǎng)站,這可以達到成千上萬。
Usually interested in the whole data record (e.g. every column in the product table).
通常對整個數(shù)據(jù)記錄感興趣(例如,產(chǎn)品表中的每一列)。
Analytical workloads, on the other hand, refer to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end-users.
分析工作負載 ,而另一方面,是指工作量分析的目的,那種工作量是這本書的會談。 運行數(shù)據(jù)報告時,查詢將發(fā)送到數(shù)據(jù)庫以計算結(jié)果,然后顯示給最終用戶。
SELECT category_name, count(*) as num_products FROM products GROUP BY 1(The above query scans the entire products table to count how many products are there in each category)
(上面的查詢會掃描整個產(chǎn)品表,以計算每個類別中有多少個產(chǎn)品)
Analytical workloads, on the other hand, have the following attributes:
另一方面,分析工作負載具有以下屬性:
Each query typically scans a large number of rows in the table.
每個查詢通常會掃描表中的大量行 。
Each query is heavy and takes a long time (minutes, or even hours) to finish
每個查詢都很繁瑣,需要很長時間 (幾分鐘甚至幾小時)才能完成
Not a lot of concurrent queries happen, limited by the number of reports or internal staff members using the analytics system.
并發(fā)查詢不會很多 ,受使用分析系統(tǒng)的報告或內(nèi)部員工數(shù)量的限制。
Usually interested in just a few columns of data.
通常只對幾列數(shù)據(jù)感興趣。
Below is a comparison table between transactional vs analytical workload/databases.
下表是事務性與分析性工作負載/數(shù)據(jù)庫之間的比較表。
Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.
事務性工作負載具有許多簡單查詢,而分析工作負載則具有很少的繁重查詢。
Analytics(分析)資料庫的后端不同 (The Backend for Analytics Databases is Different)
Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.
由于上述兩個工作負載之間的巨大差異,因此兩個工作負載的數(shù)據(jù)庫基礎(chǔ)后端設(shè)計非常不同。 事務數(shù)據(jù)庫針對并發(fā)量大的快速,短查詢進行了優(yōu)化,而分析數(shù)據(jù)庫針對長時間運行的資源密集型查詢進行了優(yōu)化。
What are the differences in architecture you ask? This will take a dedicated section to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:
您要求的架構(gòu)有什么區(qū)別? 這將用專門的部分進行解釋,但要點是分析數(shù)據(jù)庫使用以下技術(shù)來保證卓越的性能:
Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.
列式存儲引擎:分析數(shù)據(jù)庫不是將數(shù)據(jù)逐行存儲在磁盤上,而是將數(shù)據(jù)列分組在一起并進行存儲。
Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.
壓縮列數(shù)據(jù):每列中的數(shù)據(jù)都經(jīng)過壓縮,以減少存儲量并加快檢索速度。
Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)
查詢執(zhí)行的并行化:現(xiàn)代分析數(shù)據(jù)庫通常在數(shù)千臺計算機上運行。 因此,每個分析查詢都可以拆分為多個較小的查詢,以在這些機器之間并行執(zhí)行(分而治之策略)
As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.
您可能現(xiàn)在已經(jīng)猜到了,MySQL,PostgreSQL,MSSQL和Oracle數(shù)據(jù)庫旨在處理事務性工作負載,而數(shù)據(jù)倉庫旨在處理分析性工作負載。
那么,我可以使用普通SQL數(shù)據(jù)庫作為數(shù)據(jù)倉庫嗎? (So, can I use a normal SQL database as my data warehouse?)
Like we’ve said earlier, yes you can, but it depends.
就像我們之前說過的,可以,但是要視情況而定。
If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.
如果您只是從少量數(shù)據(jù)和很少的分析用例開始,那么最好選擇一個普通SQL數(shù)據(jù)庫作為您的數(shù)據(jù)倉庫(最受歡迎的是MySQL,PostgreSQL,MSSQL或Oracle)。 如果您的數(shù)據(jù)量相對較大,您仍然可以,但是需要進行適當?shù)恼{(diào)整和配置。
That said, with the advent of low-cost data warehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.
也就是說,隨著BigQuery,Redshift等低成本數(shù)據(jù)倉庫的出現(xiàn),我們建議您繼續(xù)使用數(shù)據(jù)倉庫。
However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics. We’ve also written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.
但是,如果必須選擇一個基于SQL的常規(guī)數(shù)據(jù)庫(例如,您的企業(yè)只允許您在自己的網(wǎng)絡(luò)中內(nèi)部托管該數(shù)據(jù)庫),我們建議使用PostgreSQL,因為它具有分析支持的最多功能。 我們還在此處寫了一篇詳細的博客文章,討論了該主題: 為什么要在分析上使用PostgreSQL over MySQL 。
摘要 (Summary)
In this post, we zoomed in into data warehouse and spoke about:
在這篇文章中,我們放大了數(shù)據(jù)倉庫并談到了:
- Data warehouse is the central analytics database that stores & processes your data for analytics 數(shù)據(jù)倉庫是中央分析數(shù)據(jù)庫,用于存儲和處理數(shù)據(jù)以進行分析
- The 4 trigger points when you should get a data warehouse 獲取數(shù)據(jù)倉庫的4個觸發(fā)點
- A simple list of data warehouse technologies you can choose from 您可以選擇的簡單數(shù)據(jù)倉庫技術(shù)列表
- How a data warehouse is optimized for analytical workload vs traditional database for transactional workload. 與分析事務處理的傳統(tǒng)數(shù)據(jù)庫相比,如何優(yōu)化數(shù)據(jù)倉庫以進行分析工作負載。
Originally published at The Analytics Setup Guidebook by Holistics: Understanding The Data Warehouse.
最初發(fā)布于 Holistics的《 Analytics設(shè)置指南》中:了解數(shù)據(jù)倉庫 。
翻譯自: https://towardsdatascience.com/what-is-a-data-warehouse-when-and-why-to-consider-one-2e826be68e95
總結(jié)
以上是生活随笔為你收集整理的什么是数据仓库,何时以及为什么要考虑一个的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 刚认识女孩说不要浪费时间_不要浪费时间寻
- 下一篇: 梦到一大盆肉怎么回事