go 捕获数据库新增数据_更改数据捕获的经验教训
go 捕獲數(shù)據(jù)庫新增數(shù)據(jù)
The article summarizes experiences from various projects with a log-based change data capture (CDC). There are many use cases for which CDC is beneficial. Some DBs even have CDC functionality integrated without requiring a separate tool.
本文總結(jié)了使用基于日志的變更數(shù)據(jù)捕獲(CDC)的各種項目的經(jīng)驗。 CDC有益于很多用例。 一些DB甚至集成了CDC功能,而無需單獨的工具。
變更檢測 (Change detection)
Detecting changes in transactional source systems often turn out as slow and not scalable. Lightweight, high-performance change detection is required. Well-known approaches are: change columns, triggers and log-based CDC.
檢測事務性源系統(tǒng)中的更改通常會很慢且不可擴展。 需要輕巧,高性能的變化檢測。 眾所周知的方法是:更改列,觸發(fā)器和基于日志的CDC。
Change columns
變更欄
Tables in the source system get created with an additional column, such as a timestamp. Such a column can be used to identify new or changed data records.
源系統(tǒng)中的表是使用附加列(例如時間戳)創(chuàng)建的。 這樣的列可用于標識新的或更改的數(shù)據(jù)記錄。
However, this solution does not work for deleted records. Another disadvantage is that the application needs to be changed with update logic for the column.
但是,此解決方案不適用于已刪除的記錄。 另一個缺點是應用程序需要使用列的更新邏輯進行更改。
Triggers
扳機
Database triggers can protocol any data changes to a log table. A process transfers the data from the log table to the target.
數(shù)據(jù)庫觸發(fā)器可以將任何數(shù)據(jù)更改記錄到日志表中。 進程將數(shù)據(jù)從日志表傳輸?shù)侥繕恕?
This approach recognizes deletes. Additionally, the application does not need to be changed.
這種方法識別刪除。 此外,不需要更改應用程序。
However, triggers put a strain on the source database from a performance point of view. If a transaction commits, it must wait until the trigger finished, too.
但是,從性能的角度來看,觸發(fā)器對源數(shù)據(jù)庫造成了壓力。 如果事務提交,它也必須等到觸發(fā)器完成。
Log-based CDC
基于日志的CDC
Databases write changes into their transaction log. Backup and recovery need transaction logs. Additionally, sequential writes into a transaction log are much faster compared to random writes into data files.
數(shù)據(jù)庫將更改寫入其事務日志。 備份和恢復需要事務日志。 此外,與對數(shù)據(jù)文件的隨機寫入相比,對事務日志的順序?qū)懭胍斓枚唷?
Open source tools like Debezium or commercial tools like Oracle Goldengate, IBM IIDR read changes from the log and replicate the changes to the target system.
開源工具(如Debezium)或商業(yè)工具(如Oracle Goldengate,IBM IIDR)從日志中讀取更改,并將更改復制到目標系統(tǒng)。
基于日志的CDC的優(yōu)缺點 (Pros and Cons of log-based CDC)
Log-based CDC has fundamental advantages over the other change detection methods listed:
與其他列出的變更檢測方法相比,基于日志的CDC具有根本優(yōu)勢:
The additional performance impact on the source system is low.
對源系統(tǒng)的附加性能影響很小。
CDC enables the implementation of near real-time architectures with the possibility of faster and more accurate decisions based on the target systems.
CDC可以實現(xiàn)近實時架構(gòu),并有可能根據(jù)目標系統(tǒng)做出更快,更準確的決策。
No significant changes to the application in the source system are necessary. Changes relate to, for example, the setting of database parameters to generate extended transaction log entries, or installing the CDC software.
無需對源系統(tǒng)中的應用程序進行重大更改 。 更改涉及例如設置數(shù)據(jù)庫參數(shù)以生成擴展的事務日志條目或安裝CDC軟件。
CDC reduces the amount of data transmitted over the network compared to the other detection methods described above.
與上述其他檢測方法相比,CDC 減少了通過網(wǎng)絡傳輸的數(shù)據(jù)量 。
The coupling between source and target is asynchronous. A slow consumer does not affect the source database.
源與目標之間的耦合是異步的 。 緩慢的使用者不會影響源數(shù)據(jù)庫。
Log-based CDC also has some disadvantages you must be aware of:
基于日志的CDC還具有一些必須意識到的缺點:
Many commercial tools require an additional license.
許多商業(yè)工具需要額外的許可證 。
Avoid no-logging transactions as these bypasses the transaction log. However, most databases allow force logging through suitable parameter settings.
避免不記錄事務,因為它們會繞過事務日志。 但是,大多數(shù)數(shù)據(jù)庫都允許通過適當?shù)膮?shù)設置進行強制日志記錄。
Separate tools require operations and require additional know-how.
單獨的工具需要操作,并且需要其他專門知識 。
Primary or unique keys are required for many log-based CDC tools — but a good database design will guarantee this requirement anyway.
許多基于日志的CDC工具都需要主鍵或唯一鍵 -但是良好的數(shù)據(jù)庫設計將始終保證這一要求。
- If the target system is down, transaction logs must be kept until the target absorbed the changes. 如果目標系統(tǒng)關(guān)閉,則必須保留事務日志,直到目標吸收了更改為止。
基于日志的CDC如何工作? (How does log-based CDC work?)
The diagram above shows several uses of log-based CDC. Users or applications change data in the source database, e.g. insert, update, or delete data. The database writes all changes into the transaction log (or Write-ahead log) before integrating the changes into the data files.
上圖顯示了基于日志的CDC的幾種用法。 用戶或應用程序更改源數(shù)據(jù)庫中的數(shù)據(jù),例如插入,更新或刪除數(shù)據(jù)。 在將更改集成到數(shù)據(jù)文件之前,數(shù)據(jù)庫將所有更改寫入事務日志(或預寫日志)。
The RDBMS example shows a user inserting data and a second user updating the same data in the next transaction. The transaction log contains the change protocol until some retention-interval clears the entries. The data file contains just the current state.
RDBMS示例顯示了一個用戶插入數(shù)據(jù),而第二個用戶在下一個事務中更新了相同的數(shù)據(jù)。 事務日志包含更改協(xié)議,直到某個保留間隔清除條目為止。 數(shù)據(jù)文件僅包含當前狀態(tài)。
Databases use transaction logs primarily for backup and recovery. But the data can also be used to replicate changes into a target system. CDC tools use the database API to read from the transaction logs (or from the memory if the data is still in the DB cache) and transfer the data to a target system.
數(shù)據(jù)庫將事務日志主要用于備份和恢復。 但是數(shù)據(jù)還可以用于將更改復制到目標系統(tǒng)中。 CDC工具使用數(shù)據(jù)庫API從事務日志中讀取(如果數(shù)據(jù)仍在數(shù)據(jù)庫高速緩存中,則從內(nèi)存中讀取)并將數(shù)據(jù)傳輸?shù)侥繕讼到y(tǒng)。
The example shows a logical view of a CDC tool reading data from the transaction log of the source system and submitting the changes to target databases or a target data lake. There is no persistent storing of the data stream within the CDC software. Kafka could be used as a log-centric approach to capture the changes long-time and submitting the changes to further target systems.
該示例顯示了CDC工具的邏輯視圖,該工具從源系統(tǒng)的事務日志中讀取數(shù)據(jù),并將更改提交給目標數(shù)據(jù)庫或目標數(shù)據(jù)湖。 CDC軟件中沒有持久存儲數(shù)據(jù)流。 Kafka可以用作以日志為中心的方法,以長期捕獲更改并將更改提交給其他目標系統(tǒng)。
The diagram shows two cases of data propagation:
該圖顯示了兩種數(shù)據(jù)傳播情況:
Replication-mode (“Target Mirror DB”)
復制模式(“目標鏡像數(shù)據(jù)庫”)
Replication creates a copy of the source: updates will change the data in the target. Deletes will remove data from the target.
復制將創(chuàng)建源的副本:更新將更改目標中的數(shù)據(jù)。 刪除將從目標中刪除數(shù)據(jù)。
The target system does not necessarily have the same structure as the source system. It is possible to only choose a subset of tables or to replicate some selected columns within a table or filter data.
目標系統(tǒng)不一定具有與源系統(tǒng)相同的結(jié)構(gòu)。 可以僅選擇表的子集,也可以復制表中的某些選定列或過濾數(shù)據(jù)。
The example shows the target DB, which contains the current state of the changes.
該示例顯示了目標數(shù)據(jù)庫,其中包含更改的當前狀態(tài)。
Auditing-mode (“Target DWH DB” and “Target Data Lake”)
審核模式(“ Target DWH DB”和“ Target Data Lake”)
Auditing-mode keeps the whole data history. The CDC tool converts updates and deletes into inserts in the “Target DWH DB” or “Target Data Lake”. A flag indicates the kind of operation (insert, update, delete) and a timestamp field tracks when the event did happen.
審核模式保留整個數(shù)據(jù)歷史記錄。 CDC工具將更新和刪除轉(zhuǎn)換為“ Target DWH DB”或“ Target Data Lake”中的插入。 標志指示操作的類型(插入,更新,刪除),時間戳字段跟蹤事件的發(fā)生時間。
The example shows the “Target DWH DB” and “Target Data Lake”, which contain the complete protocol of changes, including operation (insert, update, delete) and timestamp.
該示例顯示了“ Target DWH DB”和“ Target Data Lake”,其中包含完整的更改協(xié)議,包括操作(插入,更新,刪除)和時間戳。
用例 (Use Cases)
There are a variety of use cases for CDC:
CDC有多種用例:
Replicate changes into a DWH or data lake.
將更改復制到DWH或數(shù)據(jù)湖中 。
Replicate changes into Kafka in a microservices architecture.
將更改復制到微服務架構(gòu)中的Kafka中 。
Upgrade a database to a higher or the latest version with minimal downtime, e.g. upgrade PostgreSQL 11.9 to PostgreSQL 12.4.
以最少的停機時間將數(shù)據(jù)庫升級到更高版本或最新版本,例如將PostgreSQL 11.9升級到PostgreSQL 12.4。
Migrate data from database X to database Y with minimal downtime, e.g. migrate MariaDB to PostgreSQL or vice versa.
將數(shù)據(jù)從數(shù)據(jù)庫X 遷移到數(shù)據(jù)庫Y的停機時間最少,例如,將MariaDB遷移到PostgreSQL,反之亦然。
Transfer data into the cloud over a secure connection.
通過安全連接 將數(shù)據(jù)傳輸到云中。
工裝 (Tooling)
Commercial tools like Oracle Golden Gate, IBM IIDR, and many others gained a high maturity, and the performance is excellent even during peak loads. These tools can replicate data between heterogeneous database systems.
諸如Oracle Golden Gate,IBM IIDR等許多商業(yè)工具獲得了很高的成熟度,即使在高峰負載期間,其性能也非常出色。 這些工具可以在異構(gòu)數(shù)據(jù)庫系統(tǒng)之間復制數(shù)據(jù)。
Additionally, there are some other interesting products:
此外,還有其他一些有趣的產(chǎn)品:
Debezium is an open-source product built on top of Kafka. Debezium has connectors to pull a change stream from DBs like PostgreSQL, MySQL, MongoDB, Cassandra, etc. and send the data to Kafka. There are some exciting activities in the open-source community going on.
Debezium 是建立在Kafka之上的開源產(chǎn)品。 Debezium具有連接器,可從PostgreSQL,MySQL,MongoDB,Cassandra等數(shù)據(jù)庫中提取更改流,并將數(shù)據(jù)發(fā)送到Kafka。 開源社區(qū)中正在進行一些激動人心的活動。
DynamoDB Streams can be enabled for the NoSQL DB DynamoDB. DynamoDB Streams captures table activities and provides JSON streams.
可以為NoSQL DB DynamoDB啟用DynamoDB流 。 DynamoDB Streams捕獲表活動并提供JSON流。
Tables in VoltDB can be declared as export targets. The declaration means that VoltDB will send a stream of changes into Kafka for such tables.
VoltDB中的表可以聲明為導出目標。 該聲明意味著VoltDB將針對此類表的更改流發(fā)送到Kafka中。
結(jié)論 (Conclusion)
CDC is an essential component in modern architectures for transferring transactional data from systems into a data stream.
CDC是現(xiàn)代體系結(jié)構(gòu)中必不可少的組件,用于將事務數(shù)據(jù)從系統(tǒng)傳輸?shù)綌?shù)據(jù)流中。
CDC enables the provisioning of transactional data in real-time without causing a significant load to the source system, requires no changes in the source application and reduces the transferred amount of data to a minimum.
CDC可以實時提供事務數(shù)據(jù),而不會給源系統(tǒng)造成很大的負擔,不需要更改源應用程序,并將傳輸?shù)臄?shù)據(jù)量減少到最低限度。
Keeping data in only one system is not always possible. Be cautious with personal data. You have to know all appearances of personal data for deletion requirements according to GDPR, for example. A data catalog can help to keep track of replicated sensitive data.
并非始終只能將數(shù)據(jù)保存在一個系統(tǒng)中。 注意個人數(shù)據(jù)。 例如,您必須了解所有出現(xiàn)的個人數(shù)據(jù),以符合GDPR的刪除要求。 數(shù)據(jù)目錄可以幫助跟蹤復制的敏感數(shù)據(jù)。
翻譯自: https://medium.com/daimler-tss-tech/change-data-capture-lessons-learnt-7976391cf78d
go 捕獲數(shù)據(jù)庫新增數(shù)據(jù)
總結(jié)
以上是生活随笔為你收集整理的go 捕获数据库新增数据_更改数据捕获的经验教训的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【项目管理】项目工作绩效域管理
- 下一篇: keras 香草编码器_完善纯香草jav