I have heard need for change data capture from several customers in past few months.
Traditionally, ETL processes have been run periodically, on a monthly or weekly basis, and use a bulk approach that moves and integrates the entire data set from the operational source systems to the target data warehouse. Now, data integration requirements have changed. Customers would like to move the changes made to enterprise data while the operational systems are running, without the need for a downtime window. They do not want to degrade performance and service levels of their operation systems.
Business conditions have changed over a period of time. It requires a new way of integrating data in real time and efficient manner.
- Business globalization and 24x7 operations. In the past, enterprises could stop online systems during the night or weekend, to provide a window of time for running bulk ETL processes. Today, running a global business with 24x7 operations means smaller or no downtime windows.
- Need for up-to-date, current data. In today’s competitive environment & competitive pressure, organization cannot afford to have their managers’ work on last week or yesterday’s data. Today, decision-makers need data that is updated a few times a day or even in real time.
- Data volumes are increasing. Data is doubling every 9 months. The larger the data volumes become, the more resources and time are required by the ETL processes. This trend challenges the bulk extract windows that are getting smaller and smaller.
- Cost reduction. Bulk ETL operations are costly and inefficient, as they require more processing power, more memory and more network bandwidth. In addition, as bulk ETL processes run for long periods of time, they also require more administration and IT resources to manage.
The first step in change data capture is detecting the changes! There are four main ways to detect changes:
- Audit columns. In most cases, the source system contains audit columns. Audit columns are appended to the end of each table to store the date and time a record was added or modified. Audit columns are usually populated via database triggers that are fired off automatically as records are inserted or updated.
- Database log scraping. Log scraping effectively takes a snapshot of the database redo log at a scheduled point in time (usually midnight) and scours it for transactions that affect the tables you care about for your ETL load. Sniffing involves a “polling” of the redo log, capturing transactions on-the-fly. Scraping the log for transactions is probably the messiest of all techniques. It’s not rare for transaction logs to “blow-out,” meaning they get full and prevent new transactions from occurring. If you’ve exhausted all other techniques and find log scraping is your last resort for finding new or changed records, persuade the DBA to create a special log to meet your specific needs.
- Timed extracts. With a timed extract you typically select all of the rows where the date in the Create or Modified date fields equal SYSDATE-1, meaning you’ve got all of yesterday’s records. Sounds perfect, right? Wrong. Loading records based purely on time is a common mistake made by most beginning ETL developers. This process is horribly unreliable. Time-based data selection loads duplicate rows when it is restarted from mid-process failures. This means that manual intervention and data cleanup is required if the process fails for any reason. Meanwhile, if the nightly load process fails to run and misses a day, a risk exists that the missed data will never make it into the data warehouse.
- Full database “diff compare.” A full diff compare keeps a full snapshot of yesterday’s database, and compares it, record by record against today’s database to find what changed. The good news is that this technique is fully general: you are guaranteed to find every change. The obvious bad news is that in many cases this technique is very resource intensive. If you must do a full diff compare, then try to do the compare on the source machine so that you don’t have to transfer the whole database into the ETL environment. Also, investigate using CRC (cyclic redundancy checksum) algorithms to quickly tell if a complex record has changed.
CDC solutions are designed to maximize the efficiency of ETL processes, minimize resource usage by replicating/moving only changes to the data (i.e., the deltas) and minimize the latency in the delivery of timely business information to the potential consumers. Change data capture solutions comprises of the following key components
- Change Capture Agents
- Changed Data Services
- Change Delivery
Change Capture Agents
Change capture agents are the software components that are responsible for the identification and capture of changes to the source operational data store. Change capture agent sits on source system and takes minimal power of source system. Typically it utilizes 1 to 2% of source system processing power. Change capture agents can be optimized and dedicated to the source system (i.e., typically using database journals, triggers or exit hooks) or by using generic methods such as data log comparison.
Change Data Services
Change data services provide a set of functions critical to achieving successful CDC, including but not limited to: filtering (e.g., receiving only committed changes), sequencing (e.g., receiving changes based on transaction/unit of work boundaries, by table or by timestamp), change data enrichment (e.g., add reference data to the delivered change for further processing purposes), life cycle management (i.e., how long will the changes be available for consuming applications) and auditing that enables monitoring of the system's end-to-end behavior, as well as the examination of trends over time.
Change Delivery
Change delivery mechanisms are responsible for the reliable delivery of changed data to change consumers -- typically an ETL program. Change delivery mechanisms can support one or more consumers and provide flexible ways by which the changes can be delivered including push and pull models. A pull model means that the change consumer asks for the changes on a periodic basis (as frequently as needed, typically every few minutes or hours), preferably using a standard interface such as ODBC or JDBC. A push model means that the change consumer listens and waits for changes, and those are delivered as soon as they are captured, typically using some messaging middleware. Another important function of change delivery is the ability to dynamically go back and ask for older changes for repeated, additional or recovery processing.
Following are two sample scenarios that highlight how organizations can leverage CDC.
Sample Scenario 1: Batch-Oriented CDC (pull CDC)
In this scenario, an ETL tool periodically requests the changes, each time receiving a batch of records that represent all the changes that were captured since the last request cycle. Change delivery requests can be done in low or high frequencies (e.g., twice a day or every 15 minutes). For many organizations, the preferred method of providing extracted changes is to expose them as records of a data source table. This approach enables the ETL tool to seamlessly access the changed records using standard interfaces such as ODBC. The CDC solution needs to take care of maintaining the position of the last change delivery and deliver new changes every time.
This scenario is very similar to traditional bulk ETL, except that it processes only the changes to the data instead of the entire source data store. This approach greatly reduces the required resources and eliminates the need for a downtime window for ETL operations.
When should organizations use this approach? This batch-oriented approach is very easy to implement, as it is similar to traditional ETL processes and capitalizes on existing skill sets. Organizations should use this method when their latency requirements are measured in hours or minutes.
Sample Scenario 2: Live/Real-Time CDC (push CDC)
In this scenario, which accommodates near real-time or real-time latency requirements, the change delivery mechanism pushes the changes to the ETL program as soon as changes are captured. This is typically done using a reliable transport such as an event-delivery mechanism or messaging middleware. Some CDC solutions use proprietary event delivery mechanisms, and some support standard messaging middleware (e.g., MQ Series).
Note that while message-oriented or event-driven integration is more common in EAI products (i.e., using tools such as Integration Brokers), many of the leading ETL tool vendors are offering such capabilities in their solutions to accommodate the demands of high-end, real-time BI applications. This real-time approach is required when the BI applications demand zero latency and the most up-to-date data.
Change Data Capture Technical Considerations
While CDC seems to offer significant advantages, there are several factors that need to be considered and evaluated, including:
Change Capture Technique. Change capture methods vary, and each has different implications on the overall solution latency, scalability and level of intrusion. Common techniques for capturing changes include reading database journals or log files, usage of database triggers or exit hooks, data comparison and programming custom event notifications within enterprise programs.
Level of Intrusion. All CDC solutions have a certain degree of system impact, making intrusion a critical evaluation factor. The highest degree of intrusion is "source code" intrusion that requires changes to be made to the enterprise applications that make the changes to the data stores. A lesser degree of intrusion is "in-process" or "address space" intrusion, which means that the change capture solution affects the operational system resources. This is the case when using database triggers and exit hooks because they run as part of the operational system and share its resources. Using database journals or archive logs is the least intrusive solution and it does not affect the operational data sources of applications.
Capture Latency. This factor is a key driver for choosing CDC in the first place. Latency is affected by the change capture method, the processing done to the changes and the choice of change delivery mechanism. As a result, changes can be streamed periodically, in high frequency or in real time. One should note that the more real-time the solution is, the more intrusive it typically is as well. Yet another point to consider is that different BI applications will have different latency requirements, and thus enterprises should look for CDC solutions that support a wide range of configurations.
Filtering and Sequencing Services. CDC solutions should provide various services to facilitate the filtering and sequencing of delivered changes. Filtering helps to guarantee that only the needed changes are indeed delivered, for example: an ETL process will typically need only the committed changes. Another example is the ability to discard redundant changes and deliver the last change to further reduce processing. Sequencing defines the order by which changes are delivered. For example, some ETL applications may need changes on a table by table basis, while others may want the changes based on units of work (i.e., across multiple tables).
Supporting Multiple Con-sumers. Captured changes may need to be delivered to more than one consumer, such as multiple ETL processes, data synchronization applications and business activity monitoring. CDC solutions need to support multiple consumers, each of which may have different latency requirements.
Failover and Recoverability. CDC solutions need to guarantee that changes will be delivered correctly, even when system, network or process failures occur. Recovery means that a change delivery stream can continue from its last position and that the solution keeps transactional integrity to the changes throughout the delivery cycle.
Mainframe and Legacy Data Sources. BI is only as good as the data it relies on. Analysts estimate that mainframe systems still store approximately 70 percent of corporate business information, and mainframes still process most of the business transactions in the world. Mainframe data sources also typically store higher volumes of data, further increasing the need for a more efficient approach to moving data such as change data capture. In addition, popular mainframe data sources such as VSAM, which are non-relational, present additional challenges when incorporating that data into BI solutions. As ETL and DW tools expect relational data, the non-relational data needs to somehow be mapped to a relational data model.
Seamless integration with ETL tools. When choosing a standalone CDC solution, enterprises should consider the ease of interoperability with its ETL program (off-the-shelf or homegrown). Standard interfaces and plug-ins can reduce risk and speed the data integration project.
Change data capture allows organization to deliver real-time business intelligence based on timely data while, at the same time, reducing the cost of data integration.
For organizations looking for ways to meet these demanding business needs, create an event-driven enterprise and provide real-time business intelligence, change data capture is a key component in the data integration architecture.
More and more organizations have started adopting change data capture solution. CDC has become integral part of data integration architecture.