Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline!

A presentation at DataXDays in June 2020 in by Robin Moffatt

Slide 1

Slide 1

Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt | #DataXDays | @rmoff

Slide 2

Slide 2

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | #DataXDays | @confluentinc

Slide 3

Slide 3

Kafka is an Event Streaming Platform App App App App request-response changelogs App messaging OR stream processing App KAFKA App App DWH Hadoop @rmoff | streaming data pipelines #DataXDays | @confluentinc

Slide 4

Slide 4

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #DataXDays | @confluentinc

Slide 5

Slide 5

Immutable Event Log Old New Messages are added at the end of the log @rmoff | #DataXDays | @confluentinc

Slide 6

Slide 6

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | #DataXDays | @confluentinc

Slide 7

Slide 7

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | #DataXDays | @confluentinc

Slide 8

Slide 8

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | #DataXDays | @confluentinc

Slide 9

Slide 9

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf @rmoff | #DataXDays | @confluentinc

Slide 10

Slide 10

Consumers have a position all of their own Old New Sally Scan is here @rmoff | #DataXDays | @confluentinc

Slide 11

Slide 11

Consumers have a position all of their own Old New Fred Sally Scan is here Scan is here @rmoff | #DataXDays | @confluentinc

Slide 12

Slide 12

Consumers have a position all of their own George Scan is here Old New Fred Sally Scan is here Scan is here @rmoff | #DataXDays | @confluentinc

Slide 13

Slide 13

Free Books! https://rmoff.dev/dataxdays @rmoff | #DataXDays | @confluentinc

Slide 14

Slide 14

@rmoff | #DataXDays | @confluentinc Photo by Victor Garcia on Unsplash Streaming Data Pipelines

Slide 15

Slide 15

Database Offload Amazon S3 RDBMS Kafka Kafka Connect Connect HDFS @rmoff | #DataXDays | @confluentinc

Slide 16

Slide 16

Stream Processing with Apache Kafka and ksqlDB order events CDC RDBMS customer orders customer Stream Processing @rmoff | #DataXDays | @confluentinc

Slide 17

Slide 17

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | #DataXDays | @confluentinc

Slide 18

Slide 18

Building a streaming data pipeline @rmoff | #DataXDays | @confluentinc

Slide 19

Slide 19

Stream Integration + Processing @rmoff | #DataXDays | @confluentinc

Slide 20

Slide 20

Integration @rmoff | #DataXDays | @confluentinc

Slide 21

Slide 21

Streaming Integration with Kafka Connect syslog Sources Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc

Slide 22

Slide 22

Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc

Slide 23

Slide 23

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc

Slide 24

Slide 24

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | #DataXDays | @confluentinc

Slide 25

Slide 25

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf @rmoff | #DataXDays | @confluentinc

Slide 26

Slide 26

Extensible Connector Transform(s) @rmoff | Converter #DataXDays | @confluentinc

Slide 27

Slide 27

Confluent Hub hub.confluent.io @rmoff | #DataXDays | @confluentinc

Slide 28

Slide 28

Stream Integration + Processing @rmoff | #DataXDays | @confluentinc

Slide 29

Slide 29

Stream Processing @rmoff | #DataXDays | @confluentinc

Slide 30

Slide 30

} “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash { @rmoff | #DataXDays | @confluentinc

Slide 31

Slide 31

SELECT * FROM WIDGETS WHERE WEIGHT_G > 120 { SELECT COUNT(*) FROM WIDGETS GROUP BY PRODUCTION_LINE } “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash SELECT AVG(TEMP_CELCIUS) AS TEMP FROM WIDGETS GROUP BY SENSOR_ID HAVING TEMP>20 CREATE SINK CONNECTOR dw WITH ( Object store, ‘connector.class’ = ‘S3Connector’, data warehouse, ‘topics’ = ‘widgets’ RDBMS …); @rmoff | #DataXDays | @confluentinc

Slide 32

Slide 32

ksqlDB The event streaming database purpose-built for stream processing applications. @rmoff | #DataXDays | @confluentinc

Slide 33

Slide 33

Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc

Slide 34

Slide 34

Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc

Slide 35

Slide 35

Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc

Slide 36

Slide 36

Stream Processing with ksqlDB Source stream Analytics @rmoff | #DataXDays | @confluentinc

Slide 37

Slide 37

Stream Processing with ksqlDB Source stream Applications / Microservices @rmoff | #DataXDays | @confluentinc

Slide 38

Slide 38

Stream Processing with ksqlDB …SUM(TXN_AMT) GROUP BY AC_ID AC _I D= 42 BA LA NC AC E= _I 94 D= .0 42 0 Source stream Applications / Microservices @rmoff | #DataXDays | @confluentinc

Slide 39

Slide 39

Let’s Build It! Rating events App Pro d uc e rA PI { “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } @rmoff | #DataXDays | @confluentinc

Slide 40

Slide 40

Let’s Build It! Rating events App User data Pro d uc e rA PI a k f a K t c e n n o C RDBMS @rmoff | #DataXDays | @confluentinc

Slide 41

Slide 41

Let’s Build It! Rating events App User data RDBMS Pro d uc e rA PI a k f a K t c e n n o C Kafka Connect Operational Dashboard Elasticsearch ksqlDB Join events to users, and filter @rmoff | #DataXDays | @confluentinc

Slide 42

Slide 42

Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS Pro d I P A r e m n Co User data Push notification Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc

Slide 43

Slide 43

Let’s Build It! Rating events App Push notification Pro d I P A r e m App u s n o C uc e rA PI Demo Time! Elasticsearch a fk t Ka ec n RDBMS Operational Dashboard n Co User data a k f a K t c e n n o C Kafka Connect Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc

Slide 44

Slide 44

Push notification Rating events App Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C App n Co User data Pro d I P A r e m Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc

Slide 45

Slide 45

Push notification to Slack Rating events App Pro d I P A r e m App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch n Co a fk t Ka ec n a k f a K t c e n n o C poor_ratings Data ratings ksqlDB User data RDBMS Lake Filter events @rmoff S3/HDFS/ SnowflakeDB | etc #DataXDays | @confluentinc

Slide 46

Slide 46

{ Filter all ratings where STARS<3 “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 @rmoff | #DataXDays | @confluentinc

Slide 47

Slide 47

Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS Pro d I P A r e m n Co User data Push notification to Slack Join events to users, and filter @rmoff Data Lake SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc

Slide 48

Slide 48

Producer API MySQL t c e n n o C a k f Ka m u i z e b e D @rmoff | #DataXDays | @confluentinc

Slide 49

Slide 49

Time The Stream/Table Duality Stream Account ID Amount 12345 + £50 12345

  • £25 12345
  • £60 Account ID Balance Table 12345 £50 Account ID Balance 12345 £75 Account ID Balance 12345 £15 @rmoff | #DataXDays | @confluentinc

Slide 50

Slide 50

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

Slide 51

Slide 51

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } @rmoff | #DataXDays | @confluentinc

Slide 52

Slide 52

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers t c e n n o C a k f a K UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } @rmoff | #DataXDays | @confluentinc

Slide 53

Slide 53

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @rmoff | #DataXDays | @confluentinc

Slide 54

Slide 54

Kafka Connect → Elasticsearch @rmoff | #DataXDays | @confluentinc

Slide 55

Slide 55

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | #DataXDays | @confluentinc

Slide 56

Slide 56

Free Books! https://rmoff.dev/dataxdays @rmoff | #DataXDays | @confluentinc

Slide 57

Slide 57

60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.

Slide 58

Slide 58

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 59

Slide 59

Confluent Community Slack group cnfl.io/slack @rmoff | #DataXDays | @confluentinc

Slide 60

Slide 60

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up http://rmoff.dev/youtube • https://rmoff.dev/kafka101 • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | #DataXDays | @confluentinc

Slide 61

Slide 61

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op @rmoff | #DataXDays | @confluentinc