Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

A presentation at Berlin Kafka Meetup in July 2018 in Berlin, Germany by Robin Moffatt

Slide 1

Slide 1

Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! @rmoff [email protected] https://speakerdeck.com/rmoff/

Slide 2

Slide 2

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 2 • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff
• Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts $ whoami https://speakerdeck.com/rmoff/

Slide 3

Slide 3

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 3 App App App App search Hadoop DWH monitoring security MQ MQ cache cache A bit of a mess…

Slide 4

Slide 4

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 4 Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App

request-response

messaging OR stream processing streaming data pipelines

changelogs

Slide 5

Slide 5

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 5 Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS CDC

Slide 6

Slide 6

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 6 Stream Processing with Apache Kafka and KSQL order events customer customer orders Stream Processing RDBMS CDC

Slide 7

Slide 7

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 7 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC

Slide 8

Slide 8

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC

Slide 9

Slide 9

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC

Slide 10

Slide 10

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 10 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data Let’s Build It!

Slide 11

Slide 11

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 11 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It!

Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data

Slide 12

Slide 12

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 12 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Kafka Connect Kafka Connect Kafka Connect Kafka Connect

Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data

Slide 13

Slide 13

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 13 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks

Amazon S3

syslog flat file CSV JSON

Slide 14

Slide 14

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 14 ✓ Fault tolerant and automatically load balanced

✓ Extensible API

✓ Single Message Transforms

✓ Part of Apache Kafka, included in 
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { " connector.class " : "io.confluent.connect.jdbc.JdbcSourceConnector" , " connection.url " : "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo" , " table.whitelist " : "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration

✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema Kafka Connect

Slide 15

Slide 15

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 15 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect

Slide 16

Slide 16

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 16 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message

Slide 17

Slide 17

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 17 Confluent Hub hub.confluent.io • One-stop place to discover and download : • Connectors • Transformations • Converters

Slide 18

Slide 18

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 18 MySQL Debezium Kafka Connect Producer API Demo Time!

Slide 19

Slide 19

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 19 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It! Kafka Connect Kafka Connect Kafka Connect

Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data

Slide 20

Slide 20

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 20 Join events to users, and filter RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API KSQL Kafka Connect Kafka Connect Kafka Connect KSQL

Rating events Push notification to Slack Operational Dashboard Data Lake User data

Slide 21

Slide 21

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! Declarative Stream Language Processing KSQL is a

Slide 22

Slide 22

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL is the Streaming SQL Engine for Apache Kafka

Slide 23

Slide 23

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL in Development and Production Interactive KSQL 
 for development and testing Headless KSQL 
 for Production Desired KSQL queries

have been identified REST “Hmm, let me try 
 out this idea...”

Slide 24

Slide 24

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action

FROM clickstream c

LEFT JOIN

users u

ON c.userid = u.user_id 


WHERE u.level = 'Platinum' ; Joining, filtering, and aggregating streams of event data

Slide 25

Slide 25

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Anomaly Detection CREATE TABLE possible_fraud AS 
 SELECT card_number, count(*) 


FROM authorization_attempts 


WINDOW TUMBLING (SIZE 5 SECONDS) 


GROUP BY card_number 


HAVING count(*) > 3; Identifying patterns or anomalies in real-time data,
surfaced in milliseconds

Slide 26

Slide 26

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data

• Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS
SELECT HOST, MESSAGE

FROM SYSLOG

WHERE MESSAGE LIKE '%Invalid user%' ; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

Slide 27

Slide 27

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT=' AVRO ', TIMESTAMP=' view_time ') AS

SELECT * FROM clickstream

PARTITION

BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line

Slide 28

Slide 28

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 28 Elasticsearch Kafka Connect Demo Time!

MySQL Debezium Kafka Connect Producer API

Slide 29

Slide 29

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 29 Producer API { "rating_id" : 5313 ,

"user_id" : 3 ,

"stars" : 4 ,

"route_id" : 6975 ,

"rating_time" : 1519304105213 ,

"channel" : "web" ,

"message" : "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT

FROM ratings WHERE STARS < 3

Slide 30

Slide 30

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 30 Do you think that’s a table

you are querying?

Slide 31

Slide 31

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 31 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345

  • €50 12345
  • €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time Strea m Ta b l e

Slide 32

Slide 32

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 32 The truth is the log.
The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by   Bobby Burch   on   Unsplash

Slide 33

Slide 33

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 33 Kafka Connect Producer API { "rating_id" : 5313 ,

"user_id" : 3 ,

"stars" : 4 ,

"route_id" : 6975 ,

"rating_time" : 1519304105213 ,

"channel" : "web" ,

"message" : "worst. flight. ever. #neveragain" } {

"id" : 3 ,

"first_name" : "Merilyn" ,

"last_name" : "Doughartie" ,

"email" : "[email protected]" ,

"gender" : "Female" ,

"club_status" : "platinum" ,

"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS

SELECT * FROM RATINGS LEFT JOIN CUSTOMERS
ON R.ID=C.ID;

Slide 34

Slide 34

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 34 Kafka Connect Producer API { "rating_id" : 5313 ,

"user_id" : 3 ,

"stars" : 4 ,

"route_id" : 6975 ,

"rating_time" : 1519304105213 ,

"channel" : "web" ,

"message" : "worst. flight. ever. #neveragain" } {

"id" : 3 ,

"first_name" : "Merilyn" ,

"last_name" : "Doughartie" ,

"email" : "[email protected]" ,

"gender" : "Female" ,

"club_status" : "platinum" ,

"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS

SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS <

3

Slide 35

Slide 35

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 35 Kafka Connect Producer API { "rating_id" : 5313 ,

"user_id" : 3 ,

"stars" : 4 ,

"route_id" : 6975 ,

"rating_time" : 1519304105213 ,

"channel" : "web" ,

"message" : "worst. flight. ever. #neveragain" } {

"id" : 3 ,

"first_name" : "Merilyn" ,

"last_name" : "Doughartie" ,

"email" : "[email protected]" ,

"gender" : "Female" ,

"club_status" : "platinum" ,

"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data RATINGS_BY_CLUB_STATUS_1MIN Aggregate per-minute by CLUB_STATUS CREATE

TABLE

RATINGS_BY_CLUB_STATUS

AS

SELECT CLUB_STATUS, COUNT ( * ) FROM RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS;

Slide 36

Slide 36

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 36 Confluent Open Source :
Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data 
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Platform Confluent Platform Apache Kafka ® Core | Connect API | Streams API Data Compatibility Schema Registry Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source SQL Stream Processing KSQL

Confluent Enterprise

Monitoring & Administration Confluent Control Center | Security

Operations Replicator | Auto Data Balancing

Confluent Enterprise

Slide 37

Slide 37

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 37 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle

Slide 38

Slide 38

@rmoff [email protected] https://www.confluent.io/download/ http://cnfl.io/slack

Slide 39

Slide 39

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 39 • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides

• Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL

• Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides

• https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links

Slide 40

Slide 40

@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 40 • CDC Spreadsheet

• Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC

• #partner-engineering on Slack for questions • BD team (#partners / [email protected] ) can help with introductions on a given sales op Resources #EOF