Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt | #DataXDays | @rmoff
A presentation at DataXDays in June 2020 in by Robin Moffatt
Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt | #DataXDays | @rmoff
$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | #DataXDays | @confluentinc
Kafka is an Event Streaming Platform App App App App request-response changelogs App messaging OR stream processing App KAFKA App App DWH Hadoop @rmoff | streaming data pipelines #DataXDays | @confluentinc
What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #DataXDays | @confluentinc
Immutable Event Log Old New Messages are added at the end of the log @rmoff | #DataXDays | @confluentinc
Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | #DataXDays | @confluentinc
Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | #DataXDays | @confluentinc
Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | #DataXDays | @confluentinc
Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf @rmoff | #DataXDays | @confluentinc
Consumers have a position all of their own Old New Sally Scan is here @rmoff | #DataXDays | @confluentinc
Consumers have a position all of their own Old New Fred Sally Scan is here Scan is here @rmoff | #DataXDays | @confluentinc
Consumers have a position all of their own George Scan is here Old New Fred Sally Scan is here Scan is here @rmoff | #DataXDays | @confluentinc
Free Books! https://rmoff.dev/dataxdays @rmoff | #DataXDays | @confluentinc
@rmoff | #DataXDays | @confluentinc Photo by Victor Garcia on Unsplash Streaming Data Pipelines
Database Offload Amazon S3 RDBMS Kafka Kafka Connect Connect HDFS @rmoff | #DataXDays | @confluentinc
Stream Processing with Apache Kafka and ksqlDB order events CDC RDBMS customer orders customer Stream Processing @rmoff | #DataXDays | @confluentinc
Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | #DataXDays | @confluentinc
Building a streaming data pipeline @rmoff | #DataXDays | @confluentinc
Stream Integration + Processing @rmoff | #DataXDays | @confluentinc
Integration @rmoff | #DataXDays | @confluentinc
Streaming Integration with Kafka Connect syslog Sources Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc
Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc
Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect @rmoff | Kafka Brokers #DataXDays | @confluentinc
Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | #DataXDays | @confluentinc
Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf @rmoff | #DataXDays | @confluentinc
Extensible Connector Transform(s) @rmoff | Converter #DataXDays | @confluentinc
Confluent Hub hub.confluent.io @rmoff | #DataXDays | @confluentinc
Stream Integration + Processing @rmoff | #DataXDays | @confluentinc
Stream Processing @rmoff | #DataXDays | @confluentinc
} “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash { @rmoff | #DataXDays | @confluentinc
SELECT * FROM WIDGETS WHERE WEIGHT_G > 120 { SELECT COUNT(*) FROM WIDGETS GROUP BY PRODUCTION_LINE } “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash SELECT AVG(TEMP_CELCIUS) AS TEMP FROM WIDGETS GROUP BY SENSOR_ID HAVING TEMP>20 CREATE SINK CONNECTOR dw WITH ( Object store, ‘connector.class’ = ‘S3Connector’, data warehouse, ‘topics’ = ‘widgets’ RDBMS …); @rmoff | #DataXDays | @confluentinc
ksqlDB The event streaming database purpose-built for stream processing applications. @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB Source stream @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB Source stream Analytics @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB Source stream Applications / Microservices @rmoff | #DataXDays | @confluentinc
Stream Processing with ksqlDB …SUM(TXN_AMT) GROUP BY AC_ID AC _I D= 42 BA LA NC AC E= _I 94 D= .0 42 0 Source stream Applications / Microservices @rmoff | #DataXDays | @confluentinc
Let’s Build It! Rating events App Pro d uc e rA PI { “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } @rmoff | #DataXDays | @confluentinc
Let’s Build It! Rating events App User data Pro d uc e rA PI a k f a K t c e n n o C RDBMS @rmoff | #DataXDays | @confluentinc
Let’s Build It! Rating events App User data RDBMS Pro d uc e rA PI a k f a K t c e n n o C Kafka Connect Operational Dashboard Elasticsearch ksqlDB Join events to users, and filter @rmoff | #DataXDays | @confluentinc
Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS Pro d I P A r e m n Co User data Push notification Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc
Let’s Build It! Rating events App Push notification Pro d I P A r e m App u s n o C uc e rA PI Demo Time! Elasticsearch a fk t Ka ec n RDBMS Operational Dashboard n Co User data a k f a K t c e n n o C Kafka Connect Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc
Push notification Rating events App Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C App n Co User data Pro d I P A r e m Data Lake ksqlDB Join events to users, and filter @rmoff SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc
Push notification to Slack Rating events App Pro d I P A r e m App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch n Co a fk t Ka ec n a k f a K t c e n n o C poor_ratings Data ratings ksqlDB User data RDBMS Lake Filter events @rmoff S3/HDFS/ SnowflakeDB | etc #DataXDays | @confluentinc
{ Filter all ratings where STARS<3 “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 @rmoff | #DataXDays | @confluentinc
Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch a fk t Ka ec n RDBMS Pro d I P A r e m n Co User data Push notification to Slack Join events to users, and filter @rmoff Data Lake SnowflakeDB/ S3/HDFS/etc | #DataXDays | @confluentinc
Producer API MySQL t c e n n o C a k f Ka m u i z e b e D @rmoff | #DataXDays | @confluentinc
Time The Stream/Table Duality Stream Account ID Amount 12345 + £50 12345
The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } @rmoff | #DataXDays | @confluentinc
{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers t c e n n o C a k f a K UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } @rmoff | #DataXDays | @confluentinc
{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “[email protected]”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @rmoff | #DataXDays | @confluentinc
Kafka Connect → Elasticsearch @rmoff | #DataXDays | @confluentinc
on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | #DataXDays | @confluentinc
Free Books! https://rmoff.dev/dataxdays @rmoff | #DataXDays | @confluentinc
60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.
Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io
Confluent Community Slack group cnfl.io/slack @rmoff | #DataXDays | @confluentinc
Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up http://rmoff.dev/youtube • https://rmoff.dev/kafka101 • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | #DataXDays | @confluentinc
Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op @rmoff | #DataXDays | @confluentinc