Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! @rmoff [email protected] https://speakerdeck.com/rmoff/
A presentation at Berlin Kafka Meetup in July 2018 in Berlin, Germany by Robin Moffatt
Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! @rmoff [email protected] https://speakerdeck.com/rmoff/
@rmoff /
Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
�
2
•
Developer Advocate @ Confluent
•
Working in data & analytics since 2001
•
Oracle ACE Director & Dev Champion
•
Blogging : http://rmoff.net & http://cnfl.io/rmoff
•
Twitter: @rmoff
•
Geek stuff
•
Beer & Fried Breakfasts
$ whoami
https://speakerdeck.com/rmoff/
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 3 App App App App search Hadoop DWH monitoring security MQ MQ cache cache A bit of a mess…
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 4 Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App
request-response
messaging OR stream processing streaming data pipelines
changelogs
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 5 Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS CDC
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 6 Stream Processing with Apache Kafka and KSQL order events customer customer orders Stream Processing RDBMS CDC
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 7 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 10 Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data Let’s Build It!
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 11 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It!
Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 12 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Kafka Connect Kafka Connect Kafka Connect Kafka Connect
Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 13 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks
Amazon S3
syslog flat file CSV JSON
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 14 ✓ Fault tolerant and automatically load balanced
✓ Extensible API
✓ Single Message Transforms
✓ Part of Apache Kafka, included in Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { " connector.class " : "io.confluent.connect.jdbc.JdbcSourceConnector" , " connection.url " : "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo" , " table.whitelist " : "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration
✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema Kafka Connect
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 15 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 16 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 17 Confluent Hub hub.confluent.io • One-stop place to discover and download : • Connectors • Transformations • Converters
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 18 MySQL Debezium Kafka Connect Producer API Demo Time!
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 19 RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It! Kafka Connect Kafka Connect Kafka Connect
Rating events Join events to users, and filter Push notification to Slack Operational Dashboard Data Lake User data
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 20 Join events to users, and filter RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API KSQL Kafka Connect Kafka Connect Kafka Connect KSQL
Rating events Push notification to Slack Operational Dashboard Data Lake User data
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! Declarative Stream Language Processing KSQL is a
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL is the Streaming SQL Engine for Apache Kafka
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL in Development and Production Interactive KSQL for development and testing Headless KSQL for Production Desired KSQL queries
have been identified REST “Hmm, let me try out this idea...”
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Streaming ETL CREATE STREAM vip_actions AS SELECT userid, page, action
FROM clickstream c
LEFT JOIN
users u
ON c.userid = u.user_id
WHERE u.level = 'Platinum' ; Joining, filtering, and aggregating streams of event data
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Anomaly Detection CREATE TABLE possible_fraud AS SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING
count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data
•
Sensor / IoT data
CREATE STREAM
SYSLOG_INVALID_USERS
AS
SELECT
HOST, MESSAGE
FROM SYSLOG
WHERE MESSAGE LIKE '%Invalid user%' ; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT=' AVRO ', TIMESTAMP=' view_time ') AS
SELECT * FROM clickstream
PARTITION
BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 28 Elasticsearch Kafka Connect Demo Time!
MySQL Debezium Kafka Connect Producer API
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 29 Producer API { "rating_id" : 5313 ,
"user_id" : 3 ,
"stars" : 4 ,
"route_id" : 6975 ,
"rating_time" : 1519304105213 ,
"channel" : "web" ,
"message" : "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT
FROM ratings WHERE STARS < 3
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 30 Do you think that’s a table
you are querying?
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 31 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345
@rmoff /
Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
�
32
The truth is the log.
The database is a cache
of a subset of the log.
—Pat Helland
Immutability Changes Everything
http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
Photo by
Bobby Burch
on
Unsplash
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 33 Kafka Connect Producer API { "rating_id" : 5313 ,
"user_id" : 3 ,
"stars" : 4 ,
"route_id" : 6975 ,
"rating_time" : 1519304105213 ,
"channel" : "web" ,
"message" : "worst. flight. ever. #neveragain" } {
"id" : 3 ,
"first_name" : "Merilyn" ,
"last_name" : "Doughartie" ,
"email" : "[email protected]" ,
"gender" : "Female" ,
"club_status" : "platinum" ,
"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS
SELECT
*
FROM
RATINGS
LEFT JOIN
CUSTOMERS
ON
R.ID=C.ID;
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 34 Kafka Connect Producer API { "rating_id" : 5313 ,
"user_id" : 3 ,
"stars" : 4 ,
"route_id" : 6975 ,
"rating_time" : 1519304105213 ,
"channel" : "web" ,
"message" : "worst. flight. ever. #neveragain" } {
"id" : 3 ,
"first_name" : "Merilyn" ,
"last_name" : "Doughartie" ,
"email" : "[email protected]" ,
"gender" : "Female" ,
"club_status" : "platinum" ,
"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS <
3
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 35 Kafka Connect Producer API { "rating_id" : 5313 ,
"user_id" : 3 ,
"stars" : 4 ,
"route_id" : 6975 ,
"rating_time" : 1519304105213 ,
"channel" : "web" ,
"message" : "worst. flight. ever. #neveragain" } {
"id" : 3 ,
"first_name" : "Merilyn" ,
"last_name" : "Doughartie" ,
"email" : "[email protected]" ,
"gender" : "Female" ,
"club_status" : "platinum" ,
"comments" : "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data RATINGS_BY_CLUB_STATUS_1MIN Aggregate per-minute by CLUB_STATUS CREATE
TABLE
RATINGS_BY_CLUB_STATUS
AS
SELECT CLUB_STATUS, COUNT ( * ) FROM RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS;
@rmoff /
Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
�
36
Confluent Open Source :
Apache Kafka with a bunch of cool stuff! For free!
Database Changes
Log Events
loT Data
Web Events
…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source
Confluent Open Source
Confluent Platform
Confluent Platform
Apache Kafka
®
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Development and Connectivity
Clients | Connectors | REST Proxy | CLI
Apache Open Source
Confluent Open Source
SQL Stream Processing
KSQL
Confluent Enterprise
Monitoring & Administration Confluent Control Center | Security
Operations Replicator | Auto Data Balancing
Confluent Enterprise
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 37 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
@rmoff [email protected] https://www.confluent.io/download/ http://cnfl.io/slack
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 39 • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides
• Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL
• Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides
• https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links
@rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! � 40 • CDC Spreadsheet
• Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
• #partner-engineering on Slack for questions • BD team (#partners / [email protected] ) can help with introductions on a given sales op Resources #EOF