DataCleaner(4.5)第一章
Part1. Introduction to DataCleaner 介紹DataCleaner
- |--What is data quality(DQ) 數(shù)據(jù)質(zhì)量?
- |--What is data profiling? 數(shù)據(jù)分析?
- |--What is datastore? ?數(shù)據(jù)存儲(chǔ)?
- Composite datastore ?綜合性數(shù)據(jù)存儲(chǔ)
- |--What is data monitoring? 數(shù)據(jù)監(jiān)控?
- |--What is master data management(MDM)? 主數(shù)據(jù)管理?
What is data quality (DQ)?
Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used
數(shù)據(jù)質(zhì)量即使一種概念又是一種用于說(shuō)明特定目的包含質(zhì)量數(shù)據(jù)的商業(yè)術(shù)語(yǔ)。很多時(shí)間DQ術(shù)語(yǔ)被應(yīng)用到商業(yè)決策上,
in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.
但是也值得是質(zhì)量數(shù)據(jù)被應(yīng)用到研究、質(zhì)量活動(dòng),流程等等。
Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:
處理數(shù)據(jù)質(zhì)量通常會(huì)隨著項(xiàng)目和項(xiàng)目的不同而變化,就像數(shù)據(jù)質(zhì)量的問(wèn)題會(huì)有很大的不同。數(shù)據(jù)質(zhì)量的問(wèn)題主要有:
A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).
對(duì)高質(zhì)量數(shù)據(jù)的一個(gè)不太技術(shù)性的定義是,數(shù)據(jù)具有高質(zhì)量,“如果它們適合于其在運(yùn)營(yíng)、決策和規(guī)劃方面的預(yù)期用途”(J. M. Juran)。
Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical
數(shù)據(jù)質(zhì)量分析(DQA)是對(duì)特定過(guò)程或組織的數(shù)據(jù)質(zhì)量進(jìn)行檢查的過(guò)程。數(shù)據(jù)質(zhì)量分析包括的技術(shù)元素和非技術(shù)元素。
elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.
?
例如,要做一個(gè)好的DQA,您可能需要與用戶、業(yè)務(wù)人員、伙伴組織和可能的客戶交談。
This is needed to asses what?the goal of the DQA should be.
這是用來(lái)評(píng)估DQA目標(biāo)的必要的。
From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.
從技術(shù)角度來(lái)看,DQA中的主要任務(wù)是數(shù)據(jù)分析活動(dòng),它將幫助您發(fā)現(xiàn)和度量數(shù)據(jù)中的當(dāng)前狀態(tài)。
What is data profiling?
Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.
數(shù)據(jù)分析是對(duì)數(shù)據(jù)存儲(chǔ)進(jìn)行調(diào)查以創(chuàng)建它的“概要”的活動(dòng)。有了您的數(shù)據(jù)存儲(chǔ)的概要,您將會(huì)有更好的去實(shí)際使用和改進(jìn)它。
The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either
您進(jìn)行分析的方式通常取決于您是否已經(jīng)對(duì)數(shù)據(jù)的質(zhì)量有了一些想法,或者您是否對(duì)datastore沒有經(jīng)驗(yàn)。
way we recommend an?explorative?approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!
無(wú)論哪種方式,我們都建議采用一種探索性的方法,因?yàn)榧词鼓J(rèn)為您需要查找的問(wèn)題只有一定數(shù)量,但這是我們的經(jīng)驗(yàn)(并且在數(shù)據(jù)收集者的許多特性后面進(jìn)行推理),在您認(rèn)為正確的數(shù)據(jù)中檢查這些項(xiàng)同樣重要!
Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!
通常,在你的分析中包含更多的數(shù)據(jù)是沒有價(jià)值的,結(jié)果可能會(huì)讓你大吃一驚,節(jié)省你的時(shí)間!
DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.
DataCleaner包括(在其他方面)一個(gè)桌面應(yīng)用程序,用于對(duì)任何類型的數(shù)據(jù)存儲(chǔ)進(jìn)行數(shù)據(jù)分析。
?
What is a datastore?
A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.
數(shù)據(jù)存儲(chǔ)是存儲(chǔ)數(shù)據(jù)的地方。通常企業(yè)數(shù)據(jù)都存在于關(guān)系數(shù)據(jù)庫(kù)中,但是有許多例外情況。
To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term?datastore?.
由不同來(lái)源的數(shù)據(jù)組成,例如數(shù)據(jù)庫(kù)、電子表格、XML文件,甚至標(biāo)準(zhǔn)的業(yè)務(wù)應(yīng)用程序,我們使用的是術(shù)語(yǔ)數(shù)據(jù)存儲(chǔ)。
DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.
DataCleaner能夠從非常廣泛的數(shù)據(jù)存儲(chǔ)中檢索數(shù)據(jù)。此外,DataCleaner還可以更新大多數(shù)這些數(shù)據(jù)存儲(chǔ)的數(shù)據(jù)。
A datastore can be created in the UI or via?the configuration file?. You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.
數(shù)據(jù)存儲(chǔ)可以在UI中創(chuàng)建,也可以通過(guò)配置文件創(chuàng)建。您可以從任何類型的源(如:CSV、Excel、Oracle數(shù)據(jù)庫(kù)、MySQL等)創(chuàng)建數(shù)據(jù)存儲(chǔ)。
點(diǎn)擊注冊(cè)一個(gè)新的數(shù)據(jù)存儲(chǔ)Composite datastore
A?composite?datastore contains?multiple datastores?. The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.
復(fù)合數(shù)據(jù)存儲(chǔ)包含多個(gè)數(shù)據(jù)存儲(chǔ)。復(fù)合數(shù)據(jù)存儲(chǔ)的主要優(yōu)勢(shì)在于,它允許您在同一作業(yè)中分析和處理來(lái)自多個(gè)源的數(shù)據(jù)。
?
What is data monitoring?
We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be
continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.
我們認(rèn)為,數(shù)據(jù)分析是一種理想的探索活動(dòng)。數(shù)據(jù)監(jiān)控通常不是!您在進(jìn)行概要分析時(shí)所做的度量通常需要不斷地檢查,以便您的改進(jìn)可以通過(guò)時(shí)間來(lái)執(zhí)行。這就是數(shù)據(jù)監(jiān)控的典型特征。
Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.
數(shù)據(jù)監(jiān)控解決方案有不同的形狀和大小。你可以安排自己的大部分計(jì)劃的工作每天晚上運(yùn)行。如果某個(gè)特定的度量超出了允許的閾值,或者在某些情況下,您可以通過(guò)應(yīng)用第一次正確的(FTR)原則來(lái)排除這個(gè)問(wèn)題,那么您就可以在它周圍構(gòu)建警報(bào),或者在某些情況下,您可以嘗試排除這個(gè)問(wèn)題。如。在數(shù)據(jù)登記表格等.
As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.
在版本3中,DataCleaner現(xiàn)在還包括一個(gè)監(jiān)視web應(yīng)用程序,稱為“DataCleaner monitor”。monitor是一個(gè)服務(wù)器應(yīng)用程序,它支持編排和調(diào)度作業(yè),以及通過(guò)web服務(wù)和交互式時(shí)間線和報(bào)告公開指標(biāo)。它還通過(guò)向?qū)Ш凸芾眄?yè)面支持解決方案的所有組件的配置和工作構(gòu)建過(guò)程。因此,我們喜歡說(shuō)DataCleaner monitor為一個(gè)主數(shù)據(jù)管理中心所需的基礎(chǔ)設(shè)施提供了良好的基礎(chǔ)。
What is master data management (MDM)?
Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.
主數(shù)據(jù)管理(MDM)是一個(gè)非常廣泛的術(shù)語(yǔ),它以各種方式出現(xiàn)。對(duì)于本文檔的范圍來(lái)說(shuō),它更像是數(shù)據(jù)質(zhì)量的上下文,而不是我們實(shí)際使用DataCleaner的活動(dòng)。
The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.
MDM的總體目標(biāo)是管理組織的重要數(shù)據(jù)。“主數(shù)據(jù)”指的是“單一版本的真相”。不是某個(gè)特定系統(tǒng)的數(shù)據(jù),而是一個(gè)公司的所有客戶數(shù)據(jù)或產(chǎn)品數(shù)據(jù)。通常,這些數(shù)據(jù)分散在多個(gè)數(shù)據(jù)存儲(chǔ)中,因此MDM的一個(gè)重要部分就是將數(shù)據(jù)統(tǒng)一為一個(gè)模型的過(guò)程。
Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.
顯然,在MDM中處理的另一個(gè)非常重要的問(wèn)題是數(shù)據(jù)的質(zhì)量。如果你只是聚集。“所有客戶數(shù)據(jù)”來(lái)自組織中的所有系統(tǒng),您很可能會(huì)看到大量的數(shù)據(jù)質(zhì)量問(wèn)題。將會(huì)有很多重復(fù)的條目,在客戶數(shù)據(jù)填充的方式上會(huì)有差異,會(huì)有不同的標(biāo)識(shí)符,甚至是不同的粒度級(jí)別來(lái)定義“什么是客戶”。在MDM環(huán)境中,DataCleaner可以作為引擎來(lái)清理、轉(zhuǎn)換和統(tǒng)一來(lái)自多個(gè)數(shù)據(jù)存儲(chǔ)的數(shù)據(jù),并將其統(tǒng)一到主數(shù)據(jù)的單一視圖中。
?
轉(zhuǎn)載于:https://www.cnblogs.com/xiaotao726/p/8519993.html
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的DataCleaner(4.5)第一章的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Win7系统中用anaconda配置te
- 下一篇: Python | 实现pdf文件分页