No More Silos: Integrating Databases and Apache Kafka

A presentation at KScope 19 in June 2019 in Seattle, WA, USA by Robin Moffatt

Slide 1

Slide 1

@rmoff #KScope19 Integrating Databases and Apache Kafka ®

Slide 2

Slide 2

@rmoff #KScope19 Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka

Slide 3

Slide 3

Analytics - Database Offload RDBMS @rmoff #KScope19 HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka

Slide 4

Slide 4

Real-time Event Stream Enrichment @rmoff #KScope19 order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka

Slide 5

Slide 5

@rmoff #KScope19 Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka

Slide 6

Slide 6

@rmoff #KScope19 “ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka

Slide 7

Slide 7

@rmoff #KScope19 “ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka

Slide 8

Slide 8

@rmoff #KScope19 A Customer Experience No More Silos: Integrating Databases and Apache Kafka

Slide 9

Slide 9

@rmoff #KScope19 A Sale No More Silos: Integrating Databases and Apache Kafka

Slide 10

Slide 10

@rmoff #KScope19 A Sensor Reading No More Silos: Integrating Databases and Apache Kafka

Slide 11

Slide 11

@rmoff #KScope19 An Application Log Entry No More Silos: Integrating Databases and Apache Kafka

Slide 12

Slide 12

@rmoff #KScope19 Databases No More Silos: Integrating Databases and Apache Kafka

Slide 13

Slide 13

The Stream Table Duality @rmoff #KScope19 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka

Slide 14

Slide 14

Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #KScope19 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka

Slide 15

Slide 15

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 @rmoff #KScope19 Account ID Balance 12345 €75 No More Silos: Integrating Databases and Apache Kafka

Slide 16

Slide 16

Time The Stream Table Duality @rmoff #KScope19 Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 No More Silos: Integrating Databases and Apache Kafka

Slide 17

Slide 17

Time The Stream Table Duality Stream @rmoff #KScope19 Table Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 No More Silos: Integrating Databases and Apache Kafka

Slide 18

Slide 18

@rmoff #KScope19 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash

Slide 19

Slide 19

@rmoff #KScope19 Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka

Slide 20

Slide 20

@rmoff #KScope19 Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 21

Slide 21

@rmoff #KScope19 Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 22

Slide 22

@rmoff #KScope19 Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 23

Slide 23

Kafka Connect basics Source Kafka Connect @rmoff #KScope19 Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 24

Slide 24

@rmoff #KScope19 Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 25

Slide 25

Easy to configure @rmoff #KScope19 { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } https://docs.confluent.io/current/connect/ “connection.url”: No More Silos: Integrating Databases and Apache Kafka

Slide 26

Slide 26

@rmoff #KScope19 Converters Connector Source Converter Kafka Connect https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 27

Slide 27

@rmoff #KScope19 Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf No More Silos: Integrating Databases and Apache Kafka

Slide 28

Slide 28

The Confluent Schema Registry Avro Schema @rmoff #KScope19 Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka

Slide 29

Slide 29

@rmoff #KScope19 Single Message Transforms Connector Source Transform(s) Kafka Connect Converter Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 30

Slide 30

@rmoff #KScope19 Extensible Connector Transform(s) Converter hub.confluent.io No More Silos: Integrating Databases and Apache Kafka

Slide 31

Slide 31

@rmoff #KScope19 Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka

Slide 32

Slide 32

Query-based CDC @rmoff #KScope19 SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 33

Slide 33

Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 34

Slide 34

@rmoff #KScope19 Query-based CDC SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 35

Slide 35

Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 36

Slide 36

@rmoff #KScope19 Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 37

Slide 37

@rmoff #KScope19 Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp

00000004 9d fc 5c 43

end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00

00000017 04 00 35 2e 30 2e 31 35

2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|

00000027 6f 67 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |og…………..|

00000037 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |…………….|

00000047 00 00 00 00 9d fc 5c 43

13 38 0d 00 08 00 12 00 |…….C.8……|

00000057 04 04 04 04 12 00 00 4b

00 04 1a |…….K…| No More Silos: Integrating Databases and Apache Kafka

Slide 38

Slide 38

Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 39

Slide 39

Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 40

Slide 40

Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 41

Slide 41

Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 42

Slide 42

Demo Time! @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka

Slide 43

Slide 43

@rmoff #KScope19 Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka

Slide 44

Slide 44

@rmoff #KScope19 should I use?” Photo by Tyler Nix on Unsplash “Which one No More Silos: Integrating Databases and Apache Kafka

Slide 45

Slide 45

@rmoff #KScope19 It Depends! No More Silos: Integrating and on Apache Kafka Photo by Databases Trevor Cole Unsplash

Slide 46

Slide 46

Query-based CDC @rmoff @rmoff #kafkasummit #KScope19 ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Photo by Matese Fields on Unsplash 🛑 Impact of polling the DB (or higher latencies tradeoff) No More Silos: Integrating Databases and Apache Kafka

Slide 47

Slide 47

Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 48

Slide 48

Query-based CDC @rmoff #KScope19 SELECT * CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 49

Slide 49

Query-based CDC INSERT @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 50

Slide 50

@rmoff #KScope19 Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 51

Slide 51

Query-based CDC @rmoff #KScope19 UPDATE No More Silos: Integrating Databases and Apache Kafka

Slide 52

Slide 52

@rmoff #KScope19 Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 53

Slide 53

Query-based CDC @rmoff #KScope19 DELETE No More Silos: Integrating Databases and Apache Kafka

Slide 54

Slide 54

@rmoff #KScope19 Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N No More Silos: Integrating Databases and Apache Kafka

Slide 55

Slide 55

@rmoff #KScope19 Query-based CDC orderID status address updateTS SELECT * FROM WHERE orders updateTS > previous ts No More Silos: Integrating Databases and Apache Kafka

Slide 56

Slide 56

@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka

Slide 57

Slide 57

@rmoff #KScope19 Query-based CDC orderID status address updateTS 42

10:54:00 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); No More Silos: Integrating Databases and Apache Kafka

Slide 58

Slide 58

@rmoff #KScope19 Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka

Slide 59

Slide 59

@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka

Slide 60

Slide 60

@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka

Slide 61

Slide 61

@rmoff #KScope19 Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:00 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 No More Silos: Integrating Databases and Apache Kafka

Slide 62

Slide 62

@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka

Slide 63

Slide 63

@rmoff #kafkasummit Event-driven app Query-based CDC Log-based CDC

Slide 64

Slide 64

Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 65

Slide 65

Log-based CDC @rmoff #KScope19 UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 66

Slide 66

Log-based CDC @rmoff #KScope19 UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 67

Slide 67

Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 68

Slide 68

Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 69

Slide 69

Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka

Slide 70

Slide 70

@rmoff #KScope19 Log-based CDC Immutable event log No More Silos: Integrating Databases and Apache Kafka

Slide 71

Slide 71

Photo by Sebastian Pociecha on Unsplash @rmoff #KScope19 Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka

Slide 72

Slide 72

@rmoff #KScope19 Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka

Slide 73

Slide 73

tl;dr : which tool do I use? @rmoff #KScope19 • Query-based CDC confluent.io/connector/kafka-connect-jdbc No More Silos: Integrating Databases and Apache Kafka

Slide 74

Slide 74

Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #KScope19 • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • tcVISION • Etc See also: https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka

Slide 75

Slide 75

@rmoff #KScope19 Oracle and Kafka integration •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source, Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, tcVision, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry • DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not “true” CDC https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka

Slide 76

Slide 76

https://twitter.com/gunnarmorling/status/1143534042832392192 Debezium and Oracle @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka

Slide 77

Slide 77

Real-time Event Stream Enrichment @rmoff #KScope19 ratings Customer ratings C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka

Slide 78

Slide 78

Demo Time! @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka

Slide 79

Slide 79

@rmoff #KScope19 Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka

Slide 80

Slide 80

Confluent Community Components @rmoff #KScope19 Apache Kafka with a bunch of cool stuff! For free! Log Events Database Changes loT Data Web Events … Confluent Platform Data Integration Real-time Applications Monitoring & Administration Confluent Control Center | Security Confluent Platform Transformations Hadoop Operations Replicator | Auto Data Balancing Custom Apps Database Data Compatibility Schema Registry SQL Stream Processing KSQL Analytics Data Warehouse Development and Connectivity Clients | Connectors | REST Proxy | CLI CRM Monitoring Apache Kafka® Core | Connect API | Streams API … CUSTOMER SELF-MANAGED Datacenter Public Cloud … CONFLUENT FULLY-MANAGED Confluent Cloud No More Silos: Integrating Databases and Apache Kafka

Slide 81

Slide 81

@rmoff #KScope19 http://cnfl.io/book-bundle No More Silos: Integrating Databases and Apache Kafka

Slide 82

Slide 82

@rmoff #KScope19 CONFLUENT COMMUNITY DISCOUNT CODE KS19Meetup. 25% OFF* *Standard Priced Conference pass No More Silos: Integrating Databases and Apache Kafka

Slide 83

Slide 83

@rmoff #KScope19 https://www.confluent.io/download/ http://cnfl.io/kafka-cdc http://cnfl.io/slack @rmoff [email protected] No More Silos: Integrating Databases and Apache Kafka

Slide 84

Slide 84

@rmoff #KScope19 #EOF No More Silos: Integrating Databases and Apache Kafka