@rmoff #kafkasummit Integrating Databases and Apache Kafka ®
A presentation at Kafka Summit New York 2019 in April 2019 in New York, NY, USA by Robin Moffatt
@rmoff #kafkasummit Integrating Databases and Apache Kafka ®
@rmoff #kafkasummit Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka
Analytics - Database Offload RDBMS @rmoff #kafkasummit HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #kafkasummit order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit “ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit “ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit A Customer Experience No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit A Sale No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit A Sensor Reading No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit An Application Log Entry No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Databases No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Do you think that’s a table you are querying? No More Silos: Integrating Databases and Apache Kafka
The Stream Table Duality @rmoff #kafkasummit Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #kafkasummit Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 12345
Time The Stream Table Duality @rmoff #kafkasummit Account ID Amount 12345 + €50 12345
Time The Stream Table Duality Stream @rmoff #kafkasummit Table Account ID Amount 12345 + €50 12345
@rmoff #kafkasummit The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash
@rmoff #kafkasummit Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Streaming Integration with Kafka Connect syslog flat file CSV JSON Sources MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Streaming Integration with Kafka Connect Amazon S3 Sinks MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Streaming Integration with Kafka Connect Amazon S3 syslog flat file CSV JSON Sources Sinks MQTT MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
Kafka Connect basics Source Kafka Connect @rmoff #kafkasummit Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
Easy to configure @rmoff #kafkasummit { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } https://docs.confluent.io/current/connect/ “connection.url”: No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Converters Connector Source Converter Kafka Connect https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf No More Silos: Integrating Databases and Apache Kafka
The Confluent Schema Registry Avro Schema @rmoff #kafkasummit Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka
Single Message Transforms Connector Source Transform(s) Kafka Connect @rmoff #kafkasummit Converter Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Extensible Connector Transform(s) Converter hub.confluent.io No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp
end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00
2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|
00 00 00 00 00 00 00 00 |og…………..|
00 00 00 00 00 00 00 00 |…………….|
13 38 0d 00 08 00 12 00 |…….C.8……|
00 04 1a |…….K…| No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit Transaction log No More Silos: Integrating Databases and Apache Kafka
Demo Time! @rmoff #kafkasummit No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit should I use?” Photo by Tyler Nix on Unsplash “Which one No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit It Depends! No More Silos: Integrating and on Apache Kafka Photo by Databases Trevor Cole Unsplash
Query-based CDC @rmoff #kafkasummit ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Photo by Matese Fields on Unsplash 🛑 Impact of polling the DB (or higher latencies tradeoff) No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit SELECT * CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC INSERT @rmoff #kafkasummit SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit UPDATE No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #kafkasummit DELETE No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address updateTS SELECT * FROM WHERE orders updateTS > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
10:54:00 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:00 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Event-driven app Query-based CDC Log-based CDC
Log-based CDC @rmoff #kafkasummit Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #kafkasummit Immutable event log No More Silos: Integrating Databases and Apache Kafka
Photo by Sebastian Pociecha on Unsplash @rmoff #kafkasummit Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
tl;dr : which tool do I use? @rmoff #kafkasummit • Query-based CDC confluent.io/connector/kafka-connect-jdbc No More Silos: Integrating Databases and Apache Kafka
Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #kafkasummit • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • Etc See also: https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #kafkasummit ratings Customer ratings C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
Demo Time! @rmoff #kafkasummit No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka
Confluent Community Components @rmoff #kafkasummit Apache Kafka with a bunch of cool stuff! For free! Log Events Database Changes loT Data Web Events … Confluent Platform Data Integration Real-time Applications Monitoring & Administration Confluent Control Center | Security Confluent Platform Transformations Hadoop Operations Replicator | Auto Data Balancing Custom Apps Database Data Compatibility Schema Registry SQL Stream Processing KSQL Analytics Data Warehouse Development and Connectivity Clients | Connectors | REST Proxy | CLI CRM Monitoring Apache Kafka® Core | Connect API | Streams API … CUSTOMER SELF-MANAGED Datacenter Public Cloud … CONFLUENT FULLY-MANAGED Confluent Cloud No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit http://cnfl.io/book-bundle No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit https://www.confluent.io/download/ http://cnfl.io/kafka-cdc http://cnfl.io/slack @rmoff [email protected] No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit #EOF No More Silos: Integrating Databases and Apache Kafka