数据分析 数据科学_数据科学中的数据分析
數(shù)據(jù)分析 數(shù)據(jù)科學(xué)
資料剖析 (Data Profiling)
Data Profiling is a method of examining data from an existing supply and summarizing info this data. Your profile data to work out the accuracy, completeness, and validity of your data. Information identification is in dire straits several reasons, however, it's most typically a part of serving to work out information quality as an element of a bigger project. Commonly, Data Profiling is combined with an?ETL (Extract, Transform, and Load)?method to maneuver data from one system to a different. Once done properly, ETL and Data Profiling is combined to cleanse, enrich, and move quality information to a target location.
數(shù)據(jù)分析是一種檢查來(lái)自現(xiàn)有供應(yīng)商的數(shù)據(jù)并匯總此數(shù)據(jù)信息的方法。 您的個(gè)人資料數(shù)據(jù)可以計(jì)算出數(shù)據(jù)的準(zhǔn)確性,完整性和有效性。 信息識(shí)別陷入困境的原因有很多,但是,它通常是確定信息質(zhì)量的一部分,這是大型項(xiàng)目的一個(gè)組成部分。 通常,數(shù)據(jù)分析與ETL(提取,轉(zhuǎn)換和加載)方法結(jié)合使用,可以將數(shù)據(jù)從一個(gè)系統(tǒng)轉(zhuǎn)移到另一個(gè)系統(tǒng)。 一旦正確完成,ETL和數(shù)據(jù)分析將結(jié)合起來(lái),以清理,豐富質(zhì)量信息并將其移動(dòng)到目標(biāo)位置。
For example, you may need to perform data profiling once migrating from a gift system to a brand new system. Data Profiling will facilitate establish data quality problems that require to be handled within the code after you move data into your new system Or you may need to perform data profiling as you progress data to a data warehouse for business analytics. Typically once data is captive to a data warehouse, ETL tools are accustomed to moving the Data. Data profiling is useful in characteristic what data quality problems should be fastened within the supply, and what data quality problems are fastened throughout the ETL method.
例如,從禮品系統(tǒng)遷移到全新系統(tǒng)后,您可能需要執(zhí)行數(shù)據(jù)分析。 數(shù)據(jù)剖析有助于建立數(shù)據(jù)質(zhì)量問(wèn)題,這些問(wèn)題需要在將數(shù)據(jù)移至新系統(tǒng)中之后在代碼中進(jìn)行處理,或者在將數(shù)據(jù)前進(jìn)到數(shù)據(jù)倉(cāng)庫(kù)進(jìn)行業(yè)務(wù)分析時(shí)可能需要執(zhí)行數(shù)據(jù)剖析。 通常,一旦數(shù)據(jù)被捕獲到數(shù)據(jù)倉(cāng)庫(kù)中,ETL工具就會(huì)習(xí)慣于移動(dòng)數(shù)據(jù)。 數(shù)據(jù)概要分析有助于確定應(yīng)在供應(yīng)中解決哪些數(shù)據(jù)質(zhì)量問(wèn)題以及在整個(gè)ETL方法中解決哪些數(shù)據(jù)質(zhì)量問(wèn)題。
為什么要分析資料? (Why profile data?)
Data profiling permits you to answer the subsequent questions on your data:
數(shù)據(jù)分析使您可以回答有關(guān)數(shù)據(jù)的后續(xù)問(wèn)題:
Is the data complete? Are there a blank or no values?
數(shù)據(jù)是否完整? 是否有空白或沒(méi)有值?
Is this data unique? How many distinct values are there? Is that the data duplicated?
此數(shù)據(jù)是否唯一? 有多少個(gè)不同的值? 數(shù)據(jù)是否重復(fù)?
Are there abnormal patterns in your data? What's the distribution of patterns in your data?
您的數(shù)據(jù)中是否存在異常模式? 數(shù)據(jù)中模式的分布是什么?
Are these the patterns I expect?
這些是我期望的模式嗎?
What varies values exist and are they expected? What are the utmost, minimum, and average values for given data? Are these the ranges I expect?
存在哪些不同的值,它們是預(yù)期的嗎? 給定數(shù)據(jù)的最大,最小和平均值是多少? 這些是我期望的范圍嗎?
Answering these queries helps you make sure that you're maintaining quality data, that — firms are progressively realizing — is that the cornerstone of a thriving business.
回答這些查詢有助于確保您正在維護(hù)質(zhì)量數(shù)據(jù)(企業(yè)正在逐步實(shí)現(xiàn)),這是業(yè)務(wù)蓬勃發(fā)展的基石。
一個(gè)配置文件如何數(shù)據(jù)? (How does one profile data?)
Data profiling is performed in several ways that, however, there are roughly 3 base ways accustomed to analyze the info.
數(shù)據(jù)分析以幾種方式執(zhí)行,但是,大約有3種基本方式習(xí)慣于分析信息。
Column profiling counts the number of times each price seems among every column during a table. This methodology helps to uncover the patterns among your data.
列分析計(jì)算表中每個(gè)列中每個(gè)價(jià)格出現(xiàn)的次數(shù)。 這種方法有助于發(fā)現(xiàn)數(shù)據(jù)中的模式。
Cross-column profiling appearance across columns to perform key and dependency analysis. Key analysis scans collections of values during a table to find a possible primary key. Dependency analysis determines the dependent relationships among a data set. Together, these analyses verify the relationships and dependencies among a table.
跨列的跨列分析外觀,以執(zhí)行鍵和依賴關(guān)系分析。 鍵分析在表期間掃描值的集合,以查找可能的主鍵。 依賴性分析確定數(shù)據(jù)集之間的依賴性關(guān)系。 這些分析共同驗(yàn)證了表之間的關(guān)系和依賴性。
Cross-table profiling appearance across tables to spot potential foreign keys. It additionally attempts to work out the similarities and variations in syntax and data varieties between tables to determine that data may well be redundant and which could be mapped along.
跨表的跨表分析外觀可發(fā)現(xiàn)潛在的外鍵。 此外,它嘗試找出表之間語(yǔ)法和數(shù)據(jù)種類的相似性和變化形式,以確定數(shù)據(jù)可能完全是冗余的并且可以沿?cái)?shù)據(jù)映射。
Rule validation is usually thought of as the ultimate step in data profiling. This can be a proactive step of adding rules that check for the correctness and integrity of the info that's entered into the system.
通常將規(guī)則驗(yàn)證視為數(shù)據(jù)概要分析的最終步驟。 這可以是添加規(guī)則的主動(dòng)步驟,該規(guī)則將檢查輸入到系統(tǒng)中的信息的正確性和完整性。
These different ways could also be performed manually by an analyst, or they'll be performed by a service that will alter these queries.
這些不同的方式也可以由分析師手動(dòng)執(zhí)行,或者由將更改這些查詢的服務(wù)來(lái)執(zhí)行。
數(shù)據(jù)分析挑戰(zhàn) (Data profiling challenges)
Data profiling is commonly troublesome because of the sheer volume of data you'll get to profile. This can be very true if you're gazing at a gift system. A gift system might need years of older data with thousands of errors. Consultants advocate that you simply phase your data as a section of your data profiling method so you'll be able to see the forest for the trees.
數(shù)據(jù)分析通常很麻煩,因?yàn)槟鷮⒁治龅臄?shù)據(jù)量很大。 如果您盯著禮物系統(tǒng),這可能是非常正確的。 禮物系統(tǒng)可能需要多年的舊數(shù)據(jù),并且有數(shù)千個(gè)錯(cuò)誤。 顧問(wèn)們提倡您只需將數(shù)據(jù)作為數(shù)據(jù)分析方法的一部分進(jìn)行分階段操作,就可以看到樹(shù)木的森林。
If you manually perform your data profiling, you should have the skill to run various queries and sift through the results to achieve meaningful insights regarding your data, which might eat up precious resources. Additionally, you may doubtless solely be ready to check a set of your overall data as a result of it's too long to travel through the complete data set.
如果您手動(dòng)執(zhí)行數(shù)據(jù)分析,則您應(yīng)該具有運(yùn)行各種查詢并篩選結(jié)果的技巧,以獲取有關(guān)數(shù)據(jù)的有意義的見(jiàn)解,這可能會(huì)消耗寶貴的資源。 此外,由于時(shí)間太長(zhǎng),無(wú)法遍歷完整的數(shù)據(jù)集,因此毫無(wú)疑問(wèn),您可能只準(zhǔn)備檢查一組整體數(shù)據(jù)。
翻譯自: https://www.includehelp.com/data-science/data-profiling.aspx
數(shù)據(jù)分析 數(shù)據(jù)科學(xué)
總結(jié)
以上是生活随笔為你收集整理的数据分析 数据科学_数据科学中的数据分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: scala 去除重复元素_Scala程序
- 下一篇: windows php5.3升级,Win