A presentation at Code.Talks in in Hamburg, Germany by Robin Moffatt
Integrating Databases and Apache Kafka ® @rmoff #codetalks
@rmoff #codetalks Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka
Analytics - Database Offload RDBMS @rmoff #codetalks HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #codetalks order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks “ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks “ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Customer Experience No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Sale No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Sensor Reading No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks An Application Log Entry No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Databases No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Do you think that’s a table you are querying? No More Silos: Integrating Databases and Apache Kafka
The Stream Table Duality @rmoff #codetalks Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #codetalks Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 12345
Time The Stream Table Duality @rmoff #codetalks Account ID Amount 12345 + €50 12345
Time The Stream Table Duality Stream @rmoff #codetalks Table Account ID Amount 12345 + €50 12345
@rmoff #codetalks The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash
@rmoff #codetalks Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
Look Ma, No Code! @rmoff #codetalks { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } No More Silos: Integrating Databases and Apache Kafka
Kafka Connect basics Source Kafka Connect @rmoff #codetalks Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Converters Connector Converter bytes[] Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf No More Silos: Integrating Databases and Apache Kafka
The Confluent Schema Registry Avro Schema @rmoff #codetalks Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Single Message Transforms Connector Source Transform(s) Converter Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
Extensible Connector @rmoff #codetalks Transform(s) Converter https://kafka.apache.org/23/javadoc/ No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Confluent Hub hub.confluent.io No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp
end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00
2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|
00 00 00 00 00 00 00 00 |og…………..|
00 00 00 00 00 00 00 00 |…………….|
13 38 0d 00 08 00 12 00 |…….C.8……|
00 04 1a |…….K…| No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Demo Time! https://rmoff.dev/codetalks19-code @rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Photo by Matese Fields on Unsplash 🛑 Impact of polling the DB (or higher latencies tradeoff) No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC INSERT @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks UPDATE No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks DELETE No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS SELECT * FROM WHERE orders updateTS > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
10:54:00 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:00 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Event-driven app Query-based CDC Log-based CDC
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Log-based CDC Immutable event log No More Silos: Integrating Databases and Apache Kafka
Photo by Sebastian Pociecha on Unsplash @rmoff #codetalks Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka
tl;dr : which tool do I use? @rmoff #codetalks • If you’re wanting to do query-based CDC, the choice is simple: confluent.io/hub No More Silos: Integrating Databases and Apache Kafka
Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #codetalks • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • tcVISION • Etc See also: https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #codetalks ratings Customer ratings C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
Demo Time! https://rmoff.dev/codetalks19-code @rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
Fully Managed Kafka as a Service
http://cnfl.io/book-bundle
@rmoff #codetalks #EOF 💬 Join the Confluent Community Slack group at http://cnfl.io/slack https://talks.rmoff.net
Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event-driven architectures and the population of multiple downstream systems.
In this talk, we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered, as well as an exploration of the power of KSQL for performing transformations such as joins on the inbound data.
The following resources were mentioned during the presentation or are useful additional information.
Here’s what was said about this presentation on social media.
Here for @rmoff’s #codetalks talk, of course. pic.twitter.com/2xyeNgyjXG
— David G. Simmons 🥑 ✈️ Code.Talks (@davidgsIoT) October 24, 2019
#codetalks started with an awesome talk about integrating databases with #kafka by @rmoff, thanks for that!
— Dominik Liebler 🕵️ (@domnikl) October 24, 2019
— sudo -u -#1 !! (@invad0r) October 24, 2019
Great talk @rmoff ! Loved the demo and memes 😁 Glad you didn't get cut short.#codetalks
— sudo -u -#1 !! (@invad0r) October 24, 2019
I recommend you all have a look at @debezium to enable CDC for @MySQL, @PostgreSQL, @MongoDB, @SQLServer, @cassandra (beta), @Oracle DB.
— René Kerner (@rk3rn3r) October 24, 2019
It's awesome!
The big question! Query-based or log-based CDC (Change Data Capture)??? 🤔
— René Kerner (@rk3rn3r) October 24, 2019
I'll prefer log based CDC/Change Data Capture. And it sounds that it has advantages according to @rmoff. https://t.co/nKIuRIJVUo pic.twitter.com/bJqWYSfVYA
He definitely has the biggest terminal console right now. pic.twitter.com/nyc44T6Xae
— René Kerner (@rk3rn3r) October 24, 2019
It's awesome pic.twitter.com/Ea7v7NFrdK
— sudo -u -#1 !! (@invad0r) October 24, 2019
Great introduction to Kafka Connect by @rmoff from @confluentinc at #codetalks #codetalkshh pic.twitter.com/DEuRJW1YbC
— René Kerner (@rk3rn3r) October 24, 2019
Not too often you'll have your slides on a cinema screen. 😮😮😮 pic.twitter.com/Fa179GZsiC
— René Kerner (@rk3rn3r) October 24, 2019
.@rmoff speaking at #codetalks about how to escape data silos with @apachekafka. pic.twitter.com/6Nuo7GiDXb
— René Kerner (@rk3rn3r) October 24, 2019