當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

算法竞赛训练指南代码仓库_数据仓库综合指南

發布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了算法竞赛训练指南代码仓库_数据仓库综合指南小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

算法競賽訓練指南代碼倉庫

重點 (Top highlight)

As a data scientist, it’s valuable to have some idea of fundamental data warehouse concepts. Most of the work we do involves adding enterprise value on top of datasets that need to be clean and readily comprehensible. For a dataset to reach that stage of its lifecycle, it has already passed through many components of data architecture and, hopefully, many data quality filters. This is how we avoid the unfortunate situation wherein the data scientist ends up spending 80% of their time on data wrangling.

作為數據科學家，了解基本數據倉庫概念非常有價值。我們所做的大部分工作都涉及在需要整潔且易于理解的數據集之上增加企業價值。為了使數據集達到其生命周期的這一階段，它已經通過了數據體系結構的許多組件，并希望通過許多數據質量過濾器。這樣，我們就避免了不幸的情況，在這種情況下，數據科學家最終將80％的時間都花在了數據整理上。

Let’s take a moment to deepen our appreciation of the data architecture process by learning about various considerations relevant to setting up a data warehouse.

讓我們花一點時間，通過學習與建立數據倉庫有關的各種注意事項，加深對數據體系結構過程的認識。

The data warehouse is a specific infrastructure element that provides down-the-line users, including data analysts and data scientists, access to data that has been shaped to conform to business rules and is stored in an easy-to-query format.

的 數據倉庫 是一個特定的基礎架構元素，它為包括數據分析師和數據科學家在內的下層用戶提供對已成形為符合業務規則并以易于查詢的格式存儲的數據的訪問權限。

The data warehouse typically connects information from multiple “source-of-truth” transactional databases, which may exist within individual business units. In contrast to information stored in a transactional database, the contents of a data warehouse are reformatted for speed and ease of querying.

數據倉庫通常連接來自多個“真相”交易數據庫的信息，這些數據庫可能存在于各個業務部門中。與存儲在事務數據庫中的信息相反，數據倉庫的內容經過重新格式化，以提高查詢速度和查詢難度。

The data must conform to specific business rules that validate quality. Then it is stored in a denormalized structure — that means storing together pieces of information that will likely be queried together. This serves to increase performance by decreasing the complexity of queries required to get data out of the warehouse (i.e., by reducing the number of data joins).

數據必須符合驗證質量的特定業務規則。然后，將其存儲在非規范化結構中-這意味著將可能會被一起查詢的信息存儲在一起。這可通過降低將數據移出倉庫所需的查詢的復雜性(即通過減少數據聯接的數量)來提高性能。

In this guide:

在本指南中：

Architecting the Data Warehouse

架構數據倉庫

Enhancing Performance and Adjusting Size

增強性能并調整大小

Related Data Storage Options

相關數據存儲選項

Working with Big Data

處理大數據

Extract, Transform, Load (ETL)

提取，轉換，加載(ETL)

Getting Data out of the Warehouse

從倉庫中取出數據

Data Archiving

資料封存

Summary

摘要

架構數據倉庫 (Architecting the Data Warehouse)

In the process of developing the dimension model for the data warehouse, the design will typically pass through three stages: (1) business model, which generalizes the data based on business requirements, (2) logical model, which sets the column types, and (3) physical model, which represents the actual design blueprint of the relational data warehouse.

在開發尺寸模型的過程中對于數據倉庫，設計通常將經歷三個階段：(1)業務模型，該模型根據業務需求對數據進行概括；(2)邏輯模型，用于設置列類型；以及(3)物理模型，用于表示關系數據倉庫的實際設計藍圖。

Because the data warehouse will contain information from across all aspects of the business, stakeholders must agree in advance to the grain (i.e. level of granularity) of the data that will be stored.

由于數據倉庫將包含來自全國各地業務的各個方面的信息，利益相關者必須提前向同意糧食將被存儲的數據(粒度即水平)。

Reminder to validate the model across various stakeholder groups before implementation.

在實施之前，提醒您在各個利益相關者群體中驗證模型。

A sample star schema for a hypothetical safari tours business.假設的野生動物園旅游業務的星型樣本示例。

The underlying structure in the data warehouse is commonly referred to as the star schema — it classifies information as either a dimension or fact (i.e., measure). The fact table stores observations or events (i.e. sales, orders, stock balances, etc.) The dimension tables contain descriptive information about those facts (i.e. dates, locations, etc.)

數據倉庫中的基礎結構通常稱為星型模式 -將信息分類為維或事實(即度量)。事實表存儲觀察或事件(即銷售，訂單，庫存余額等)。維度表包含有關這些事實的描述性信息(即日期，位置等)。

There are three different types of fact tables: (1) transactional for records at the standardized grain, (2) periodic for records that fall within a given time frame, (3) cumulative for records that fall within a given business process.

事實表有三種不同類型：(1)標準化記錄的事務性記錄；(2)屬于給定時間范圍的記錄是周期性的；(3)屬于給定業務流程的記錄是累積的。

In addition to the star schema, there’s also the option to arrange data into the snowflake schema. The difference here is that each dimension is normalized.

除了星型模式外，還可以選擇將數據排列到雪花模式中。此處的區別在于每個維度均已標準化。

Normalization is a database design technique for creating records that contain an atomic level of information.

規范化是一種數據庫設計技術，用于創建包含原子級別信息的記錄。

However, the snowflake schema adds unnecessary complexity to the dimension model — usually the star schema will suffice.

但是，雪花模式會給維模型增加不必要的復雜性-通常星型就足夠了。

增強性能并調整尺寸 (Enhancing Performance and Adjusting for Size)

In addition to understanding how to structure the data, the person designing the data warehouse should also be familiar with how to improve performance.

除了了解如何構造數據之外，設計數據倉庫的人員還應該熟悉如何提高性能。

One performance-enhancing technique is to create a clustered index on the data in the order it is typically queried. So for example, we might choose to organize the fact table by TourDate descending, so the tours that are coming up next will be shown first in the table. Setting up a clustered index reorders the way the records are physically stored, promoting speed of retrieval. In addition to an optional, single clustered index, a table can also have multiple non-clustered indices that won’t impact how the table is physically stored, but rather create additional copies in memory.

一種性能增強技術是按照通常被查詢的順序在數據上創建聚簇索引。因此，例如，我們可能選擇按TourDate降序組織事實表，因此接下來要顯示的游覽將首先顯示在表中。設置聚簇索引將對記錄的物理存儲方式進行重新排序，從而提高了檢索速度。除了可選的單個聚集索引之外，一個表還可以具有多個非聚集索引，這些索引不會影響表的物理存儲方式，而是會在內存中創建其他副本。

Another performance enhancement involves splitting up very large tables into multiple smaller parts. This is called partitioning. By splitting a large table into smaller, individual tables, queries that need access to only a fraction of the data can run faster. Partitioning can be either vertical (splitting up columns) or horizontal (splitting up rows). Here’s a link where you can download an .rtf file containing partitioning script for SQL along with other database architecture resources like a project launch and management checklist.

另一個性能增強涉及將非常大的表拆分為多個較小的部分。這稱為分區。通過將大表拆分為較小的單個表，只需要訪問一部分數據的查詢可以運行得更快。分區可以是垂直的(拆分列)或水平的(拆分行)。這是一個鏈接，您可以在其中下載.rtf文件，其中包含SQL 分區腳本以及其他數據庫體系結構資源，例如項目啟動和管理清單。

Yes, I will snag your free resources and helpful tools. Photo by Slawek K on Unsplash是的，我會抓住您的免費資源和有用的工具。 Slawek K在Unsplash上的照片

Taking total database size into account is another a crucial component of tuning performance. Estimating the size of the resulting database when designing a data warehouse will help align performance with application requirements according to service level agreement (SLA). Moreover, it will provide insight into the budgeted demand for physical disk space or cost of cloud storage.

考慮數據庫的總大小是調優性能的另一個關鍵組成部分。在設計數據倉庫時，估計結果數據庫的大小將有助于根據服務水平協議(SLA)使性能與應用程序要求保持一致。此外，它將提供對物理磁盤空間或云存儲成本的預算需求的洞察力。

To conduct this calculation, simply aggregate the size of each table, which depends largely on the indexes. If database size is significantly larger than expected, you may need to normalize aspects of the database. Conversely, if your database ends up smaller, you can get away with more denormalization, which will increase query performance.

要進行此計算，只需匯總每個表的大小，這在很大程度上取決于索引。如果數據庫大小明顯大于預期，則可能需要規范化數據庫的各個方面。相反，如果數據庫最終變小，則可以避免更多的非規范化，這將提高查詢性能。

處理大數據 (Working with Big Data)

To handle big data, a data architect might chose to implement a tool such as Apache Hadoop. Hadoop was based on the MapReduce technique developed by Google to index the world wide web and was released to the public in 2006. In contrast to the highly structured environment of the data warehouse, where information has already been validated upstream to conform to business rules, Hadoop is a software library that accepts a variety of data types and allows for distributed processing across clusters of computers. Hadoop is often used to process streaming data.

為了處理大數據，數據架構師可能選擇實現諸如Apache Hadoop之類的工具。 Hadoop基于Google開發的MapReduce技術來索引萬維網，并于2006年向公眾發布。與高度結構化的數據倉庫環境相反，在數據倉庫中，信息已經在上游進行了驗證，可以符合業務規則， Hadoop是一個軟件庫，它接受各種數據類型，并允許跨計算機集群進行分布式處理。 Hadoop通常用于處理流數據。

GIPHY.GIPHY 。

While Hadoop is able to quickly process streaming data, it struggles with query speed, complexity of queries, security, and orchestration. In recent years, Hadoop has been falling out of favor as cloud-based solutions (e.g., Amazon Kinesis) have risen to prominence — offering the same gains in terms of speed for processing unstructured data while integrating with other tools in the cloud ecosystem that address these potential weaknesses.

盡管Hadoop能夠快速處理流數據，但它在查詢速度，查詢復雜性，安全性和編排方面遇到了困難。近年來，隨著基于云的解決方案(例如Amazon Kinesis )的興起，Hadoop不再受到青睞-在處理非結構化數據的速度方面與在解決方案中與云生態系統中其他解決方案集成在一起的速度方面，收益相同這些潛在的弱點。

Read more about how to approach the implementation of “new” database technologies.

閱讀有關如何實施“新”數據庫技術的更多信息。

提取，轉換，加載(ETL) (Extract, Transform, Load (ETL))

Extraction, transformation, and load define the process of moving the data out of its original location (E), doing some form of transformation (T), then loading it (L) into the data warehouse. Rather than approach the ETL pipeline in an ad hoc, piecemeal fashion, database architect should look to implement a systematic approach that takes into account best practices around design considerations, operational issues, failure points, and recovery methods. See also this helpful resource for setting up an ETL pipeline.

提取，轉換和加載定義了以下過程：將數據移出其原始位置(E)，進行某種形式的轉換(T)，然后將其加載(L)到數據倉庫中。數據庫架構師應該采取一種系統的方法，該方法考慮設計方面的考慮，操作問題，故障點和恢復方法方面的最佳做法，而不是臨時地，零散地處理ETL管道。另請參閱此有用的資源來建立ETL管道。

Documentation for ETL includes creating source-to-target mapping: the set of transformation instructions on how to convert the structure and content of data in the source system to the structure and content of the target system. Here’s a sample template for this step.

ETL的文檔包括創建源到目標的映射：一組有關如何將源系統中數據的結構和內容轉換為目標系統的結構和內容的轉換說明。這是此步驟的示例模板。

Your organization might also consider ELT — loading the data without any transformations, then using the power of the destination system (usually a cloud-based tool) to conduct the transform step.

您的組織還可能考慮使用ELT-在不進行任何轉換的情況下加載數據，然后使用目標系統(通常是基于云的工具)的強大功能來執行轉換步驟。

將數據移出倉庫 (Getting Data Out of the Warehouse)

Once the data warehouse is set up, users should be able to easily query data out of the system. A little education might be required to optimize queries, focusing on:

一旦建立了數據倉庫，用戶就應該能夠輕松地從系統中查詢數據。可能需要一些教育以優化查詢，重點在于：

Tuning a complex query
調優復雜的查詢
Using an execution plan
使用執行計劃
Understanding join mechanisms
了解聯接機制
Understand memory / disk / IO usage considerations
了解內存/磁盤/ IO使用注意事項
Using parallelism
使用并行
Writing hierarchical queries
編寫層次查詢

資料封存 (Data Archiving)

Pixabay on Pexels上的Pexels.Pixabay攝。

Finally, let’s talk about optimizing your organization’s data archiving strategy. Archived data remains important to the organization and is of particular interest to data scientists looking to conduct regression using historical trends.

最后，讓我們談談優化組織的數據歸檔戰略。歸檔數據對組織仍然很重要，并且對于希望利用歷史趨勢進行回歸的數據科學家特別感興趣。

The data architect should plan for this demand by relocating historical data that is no longer actively used into a separate storage system with higher latency but also robust search capabilities. Moving the data to a less costly storage tier is an obvious benefit of this process. The organization can also gain from removing write access from the archived data, protecting it from modification.

數據架構師應通過將不再有效使用的歷史數據重新定位到具有更高延遲但還具有強大搜索功能的單獨存儲系統中，來規劃此需求。將數據移動到成本較低的存儲層是此過程的明顯好處。該組織還可以從刪除存檔數據的寫訪問權限中受益，從而保護其免受修改。

摘要 (Summary)

This article covers tried and true practices for setting up a data warehouse. Let me know how you’re using this information in your work by dropping a comment.

本文介紹了建立數據倉庫的可靠實踐。通過添加評論，讓我知道您在工作中如何使用此信息。

Pixabay on Pexels上的Pexels.Pixabay攝

If you found this article helpful, follow me on Medium, LinkedIn, and Twitter for more ideas to advance your data science skills.

如果您認為本文很有幫助 ，請在Medium ， LinkedIn和Twitter上關注我，以獲取更多提高您的數據科學技能的想法。

翻譯自: https://towardsdatascience.com/data-warehouse-68ec63eecf78