Kafka Connect JDBC vs Debezium CDC - sql

What are the differences between JDBC Connector and Debezium SQL Server CDC Connector (or any other relational database connector) and when should I choose one over another, searching for a solution to sync between two relational databases?
Not sure if this discussion should be about CDC vs JDBC Connector, and not Debezium SQL Server CDC Connector, or even just Debezium, looking forward for later editing, depends on the given answers (Though my case is about SQL Server sink).
Sharing with you my research about this topic which led me to the question (as an answer)

This explanation focuses on the differences between Debezium SQL Server CDC Connector and JDBC Connector, with more general interpretation about Debezium and CDC.
tl;dr- scroll down :)
Debezium
Debezium is used only as a source connector, records all row-level changes.
Debezium Documentation says:
Debezium is a set of distributed services to capture changes in your
databases so that your applications can see those changes and respond
to them. Debezium records all row-level changes within each database
table in a change event stream, and applications simply read these
streams to see the change events in the same order in which they
occurred.
Debezium Connector for SQL Server first records a snapshot of the database and then sending records of row-level changes to Kafka, each table to different Kafka topic.
Debezium Connector for SQL Server Documentation says:
Debezium’s SQL Server Connector can monitor and record the row-level
changes in the schemas of a SQL Server database.
The first time it connects to a SQL Server database/cluster, it reads
a consistent snapshot of all of the schemas. When that snapshot is
complete, the connector continuously streams the changes that were
committed to SQL Server and generates corresponding insert, update and
delete events. All of the events for each table are recorded in a
separate Kafka topic, where they can be easily consumed by
applications and services.
Kafka Connect JDBC
Kafka Connect JDBC can be used either as a source or a sink connector to Kafka, supports any database with JDBC driver.
JDBC Connector Documentation says:
You can use the Kafka Connect JDBC source connector to import data
from any relational database with a JDBC driver into Apache Kafka®
topics. You can use the JDBC sink connector to export data from Kafka
topics to any relational database with a JDBC driver. The JDBC
connector supports a wide variety of databases without requiring
custom code for each one.
They have some specifications about installing on Microsoft SQL Server which I find non relevant for this discussion.
So if JDBC Connector supports both source and sink and Debezium supports only source (not sink), we understand that in order to write data from Kafka to databases with a JDBC driver (sink), the JDBC Connector is the way to go (including SQL Server).
Now the comparison should be narrowed only to the sources field.
JDBC Source Connector Documentation doesn't say much more at first sight:
Data is loaded by periodically executing a SQL query and creating an
output record for each row in the result set. By default, all tables
in a database are copied, each to its own output topic. The database
is monitored for new or deleted tables and adapts automatically. When
copying data from a table, the connector can load only new or modified
rows by specifying which columns should be used to detect new or
modified data.
Searching a little further in order to understand their differences, in this Debezium blog which uses Debezium MySQL Connector as a source and JDBC Connector as a sink, there is an explanation about the differences between the two, which generally telling us that Debezium provides records with more information about the database changes, while JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands:
The Debezium MySQL Connector was designed to specifically capture
database changes and provide as much information as possible about
those events beyond just the new state of each row. Meanwhile, the
Confluent JDBC Sink Connector was designed to simply convert each
message into a database insert/upsert based upon the structure of the
message. So, the two connectors have different structures for the
messages, but they also use different topic naming conventions and
behavior of representing deleted records.
Moreover, they have different topic naming and different delete methods:
Debezium uses fully qualified naming for target topics representing
each table it manages. The naming follows the pattern
[logical-name].[database-name].[table-name]. Kafka Connect JDBC
Connector works with simple names [table-name].
...
When the Debezium connector detects a row is deleted, it creates two
event messages: a delete event and a tombstone message. The delete
message has an envelope with the state of the deleted row in the
before field, and an after field that is null. The tombstone message
contains same key as the delete message, but the entire message value
is null, and Kafka’s log compaction utilizes this to know that it can
remove any earlier messages with the same key. A number of sink
connectors, including the Confluent’s JDBC Sink Connector, are not
expecting these messages and will instead fail if they see either kind
of message.
This Confluent blog explains more how CDC and JDBC Connector works, it (JDBC Connector) executing queries to the source database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log:
The connector works by executing a query, over JDBC, against the
source database. It does this to pull in all rows (bulk) or those that
changed since previously (incremental). This query is executed at the
interval defined in poll.interval.ms. Depending on the volumes of data
involved, the physical database design (indexing, etc.), and other
workload on the database, this may not prove to be the most scalable
option.
...
Done properly, CDC basically enables you to stream every single event
from a database into Kafka. Broadly put, relational databases use a
transaction log (also called a binlog or redo log depending on DB
flavour), to which every event in the database is written. Update a
row, insert a row, delete a row – it all goes to the database’s
transaction log. CDC tools generally work by utilising this
transaction log to extract at very low latency and low impact the
events that are occurring on the database (or a schema/table within
it).
This blog also states the differences between CDC and JDBC Connector, mainly says that JDBC Connector doesn't support syncing deleted records thus fits for prototyping, and CDC fits for more mature systems:
The JDBC Connector cannot fetch deleted rows. Because, how do you
query for data that doesn’t exist?
...
My general steer on CDC vs JDBC is that JDBC is great for prototyping,
and fine low-volume workloads. Things to consider if using the JDBC
connector:
Doesn’t give true CDC (capture delete records, want before/after
record versions) Latency in detecting new events Impact of polling the
source database continually (and balancing this with the desired
latency) Unless you’re doing a bulk pull from a table, you need to
have an ID and/or timestamp that you can use to spot new records. If
you don’t own the schema, this becomes a problem.
tl;dr Conclusion
The main differences between Debezium and JDBC Connector are:
Debezium is used only as a Kafka source and JDBC Connector can be used as Kafka source and sink.
For sources:
JDBC Connector doesn't support syncing deleted records, while Debezium does.
JDBC Connector queries the database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log.
Debezium provides records with more information about the database changes, and JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands.
Different topic naming.

Simply we can say, CDC is kind of log based streaming similarly kafka connect jdbc source connectors are query based streaming.. :)

In "JDBC Connector" you cannot capture DDL changes like new tables, columns etc. With Debezium connector you can track data structure changes so you also adjust the sink connector if it necesary.

Related

Creating Feeds between local SQL servers and Azure SQL servers?

We are wanting to use Azure servers to run our Power Apps applications, however we have local SQL servers which contains our data warehouse we want only certain tables to be on Azure and want to create data feeds between the two with information going from one to the other.
Does anyone have any insight into how I can achieve this?
I have googled but there doesn't appear to be a wealth of information on this topic.
It depends on how fast after a change in your source (the on premise SQL Server) you need that change reflected in your Sink (Azure SQL).
If you have some minutes or even only need to update it every day I would suggest a basic Data Factory Pipeline (search on google for data factory upsert). Here it depends on your data on how you can achieve this.
If you need it faster or it is impossible to extract an incremental update from your source you would need to either use triggers and write the changes from one database to the other or get a program that does change data capture that does that.
It looks like you just want to sync the data in some table between local SQL Server and Azure SQL database.
You can use the Azure SQL Data Sync.
Summary:
SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.
With Data Sync, you can keep data synchronized between your on-premises databases and Azure SQL databases to enable hybrid applications.
A Sync Group has the following properties:
The Sync Schema describes which data is being synchronized.
The Sync Direction can be bi-directional or can flow in only one
direction. That is, the Sync Direction can be Hub to Member, or
Member to Hub, or both.
The Sync Interval describes how often synchronization occurs.
The Conflict Resolution Policy is a group level policy, which can be
Hub wins or Member wins.
Next step, you need to learn how to configure the Data Sync. Please reference this Azure document:Tutorial: Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises.
In this tutorial, you learn how to set up Azure SQL Data Sync by creating a sync group that contains both Azure SQL Database and SQL Server instances. The sync group is custom configured and synchronizes on the schedule you set.
Hope this helps.
The most robust solution here is Transactional Replication. You can also use SSIS or Azure Data Factory for copying tables to/from Azure SQL Database. And Azure SQL Data Sync also exists.

How to transfer Data from One SQL server to another with out transactional replication

I have a database connected with website, data from website is inserting in that Database, i need to transfer data from that database to another Primary Database (SQL) on another server in real time (minimum latency).
I can not use transactional replication in this case. What are the other alternates to achieve this? Can i integrate DataStreams like Apache kafka etc with SQL server?
Without more detail it's hard to give a full answer. There's what's technically possible, and there's architecturally what actually makes sense :)
Yes you can stream from RDBMS to Kafka, and from Kafka to RDBMS. You can use the Kafka Connect JDBC source and sink. There are also CDC tools (e.g. Attunity, GoldenGate, etc) that support integration with MS SQL and other RDBMS)
BUT…it depends why you want the data in the second database. Do you need an exact replica of the first? If so DB-DB replication may be a better option. Kafka's a great option if you want to process the data elsewhere and/or persist it in another store. But if you just want MS SQL-MS SQL…Kafka itself may be overkill.

Managing data in two relational databases in a single location

Background: We currently have our data split between two relational databases (Oracle and Postgres). There is a need to run ad-hoc queries that involve tables in both databases. Currently we are doing this in one of two ways:
ETL from one database to another. This requires a lot of developer
time.
Oracle foreign data wrapper on our
Postgres server. This is working, but the queries run extremely
slowly.
We already use Google Cloud Platform (for the project that uses the Postgres server). We are familiar with Google BigQuery (BQ).
What we want to do:
We want most of our tables from both these databases (as-is) available at a single location, so querying them is easy and fast. We are thinking of copying over the data from both DB servers into BQ, without doing any transformations.
It looks like we need to take full dumps of our tables on a periodic basis (daily) and update BQ since BQ is append-only. The recent availability of DML in BQ seems to be very limited.
We are aware that loading the tables as is to BQ is not an optimal solution and we need to denormalize for efficiency, but this is a problem we have to solve after analyzing the feasibility.
My question is whether BQ is a good solution for us, and if yes, how to efficiently keep BQ in sync with our DB data, or whether we should look at something else (like say, Redshift)?
WePay has been publishing a series of articles on how they solve these problems. Check out https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka.
To keep everything synchronized they:
The flow of data starts with each microservice’s MySQL database. These
databases run in Google Cloud as CloudSQL MySQL instances with GTIDs
enabled. We’ve set up a downstream MySQL cluster specifically for
Debezium. Each CloudSQL instance replicates its data into the Debezium
cluster, which consists of two MySQL machines: a primary (active)
server and secondary (passive) server. This single Debezium cluster is
an operational trick to make it easier for us to operate Debezium.
Rather than having Debezium connect to dozens of microservice
databases directly, we can connect to just a single database. This
also isolates Debezium from impacting the production OLTP workload
that the master CloudSQL instances are handling.
And then:
The Debezium connectors feed the MySQL messages into Kafka (and add
their schemas to the Confluent schema registry), where downstream
systems can consume them. We use our Kafka connect BigQuery connector
to load the MySQL data into BigQuery using BigQuery’s streaming API.
This gives us a data warehouse in BigQuery that is usually less than
30 seconds behind the data that’s in production. Other microservices,
stream processors, and data infrastructure consume the feeds as well.

wso2cep : Data Storage in addition to display

I was wondering if in addition to process and display data on dashboard in wso2cep, can I store it somewhere for a long period of time to get further information later? I have studied there are two types of tables used in wso2cep, in-memory and rdbms tables.
Which one should I choose?
There is one more option that is to switch to wso2das. Is it a good approach?
Is default database is fine for that purpose or I should move towards other supported databases like sql, orcale etc?
In-memory or RDBMS?
In-memory tables will internally use java collections structures, so it'll get destroyed once the JVM is terminated (after server restart, data won't be available). On the other hand, RDBMS tables will persist data permanently. For your scenario, I think you should proceed with RDBMS tables.
CEP or DAS?
CEP will only provide real-time analytics, where DAS provides batch analytics (with Spark SQL) in addition to real-time analytics. If you have a scenario which require batch processing, incremental processing, etc ... You can go ahead with DAS. Note that, migration form CEP to DAS is quite simple (since the artifacts are identical).
Default (H2) DB or other DB?
By default WSO2 products use embedded H2 DB as data source. However, it's recommended to use MySQL or Oracle in production environments.

How can i see the sql statements in EJB 2.1 with CMP

I have an old EJB 2.1 project using DB2 as the database.
I want to see the sql queries sent by the program to the database. How can i do that. I am using DB2. The persistency is container managed persistency.(CMP)
In Hibernate there is something like <property name="hibernate.show_sql" value="true"/> I want to have the same effect. :-)
If you don't find a suitable option for capturing SQL from your persistence layer, DB2 offers some powerful tracing options at the driver level and on the database server. Each approach has its pros and cons.
Since you've described the EJB project as being old, it's possible that your persistence layer is using IBM's JDBC Type 2 driver, which is essentially a wrapper around DB2's Call Level Interface, in which case you'd be looking at enabling tracing options through the db2cli.ini file.
The newer and more popular driver is the JDBC Type 4 "universal driver", db2jcc.jar, which handles tracing through properties that can be appended to the connection string and/or set at runtime by the application.
Since I work more with databases than with application servers, my personal preference for capturing SQL is to define a statement event monitor, which captures SQL statements and detailed statistics to a flat file or a set of dedicated tables. Event monitors offer a variety of filtering mechanisms that make it possible to collect detailed trace records for only a small portion of your total workload. Another attractive aspect of event monitors is that the DBA can start or stop them without disrupting the application server. Since event monitors can quickly collect a lot of data, I prefer using tables as event monitor targets because I can be easily analyze the results with a few SQL queries.