Integrating Databases and Apache Kafka ® @rmoff #codetalks
A presentation at Code.Talks in October 2019 in Hamburg, Germany by Robin Moffatt
Integrating Databases and Apache Kafka ® @rmoff #codetalks
@rmoff #codetalks Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka
Analytics - Database Offload RDBMS @rmoff #codetalks HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #codetalks order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks “ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks “ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Customer Experience No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Sale No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks A Sensor Reading No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks An Application Log Entry No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Databases No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Do you think that’s a table you are querying? No More Silos: Integrating Databases and Apache Kafka
The Stream Table Duality @rmoff #codetalks Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #codetalks Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 12345
Time The Stream Table Duality @rmoff #codetalks Account ID Amount 12345 + €50 12345
Time The Stream Table Duality Stream @rmoff #codetalks Table Account ID Amount 12345 + €50 12345
@rmoff #codetalks The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash
@rmoff #codetalks Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
Look Ma, No Code! @rmoff #codetalks { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } No More Silos: Integrating Databases and Apache Kafka
Kafka Connect basics Source Kafka Connect @rmoff #codetalks Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Converters Connector Converter bytes[] Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf No More Silos: Integrating Databases and Apache Kafka
The Confluent Schema Registry Avro Schema @rmoff #codetalks Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Single Message Transforms Connector Source Transform(s) Converter Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
Extensible Connector @rmoff #codetalks Transform(s) Converter https://kafka.apache.org/23/javadoc/ No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Confluent Hub hub.confluent.io No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp
end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00
2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|
00 00 00 00 00 00 00 00 |og…………..|
00 00 00 00 00 00 00 00 |…………….|
13 38 0d 00 08 00 12 00 |…….C.8……|
00 04 1a |…….K…| No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Demo Time! https://rmoff.dev/codetalks19-code @rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Photo by Matese Fields on Unsplash 🛑 Impact of polling the DB (or higher latencies tradeoff) No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks SELECT * CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC INSERT @rmoff #codetalks SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks UPDATE No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #codetalks DELETE No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS SELECT * FROM WHERE orders updateTS > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
10:54:00 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:00 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Event-driven app Query-based CDC Log-based CDC
Log-based CDC @rmoff #codetalks Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #codetalks DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
@rmoff #codetalks Log-based CDC Immutable event log No More Silos: Integrating Databases and Apache Kafka
Photo by Sebastian Pociecha on Unsplash @rmoff #codetalks Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka
tl;dr : which tool do I use? @rmoff #codetalks • If you’re wanting to do query-based CDC, the choice is simple: confluent.io/hub No More Silos: Integrating Databases and Apache Kafka
Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #codetalks • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • tcVISION • Etc See also: https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #codetalks ratings Customer ratings C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
Demo Time! https://rmoff.dev/codetalks19-code @rmoff #codetalks No More Silos: Integrating Databases and Apache Kafka
Fully Managed Kafka as a Service
http://cnfl.io/book-bundle
@rmoff #codetalks #EOF 💬 Join the Confluent Community Slack group at http://cnfl.io/slack https://talks.rmoff.net