The Changing Face of ETL Event-Driven Architectures for Data Engineers Photo by rmoff @rmoff

Photo by Samuel Sianipar on Unsplash

Photo by Khai Sze Ong on Unsplash

Photo by Rainier Ridao on Unsplash

Photo by Rohit Tandon on Unsplash

Photo by Theodore Moore on Unsplash

Photo by Cristian Grecu on Unsplash

Photo by Patrick Fore on Unsplash It used to be so simple @rmoff | #ConfluentVUG | @confluentinc

Photo by Eugenio Mazzone on Unsplash More More Sources Sources @rmoff | #ConfluentVUG | @confluentinc

Photo by Tom Barrett on Unsplash More Targets More Targets @rmoff | #ConfluentVUG | @confluentinc

Photo by Kirill on Unsplash More Data More Data @rmoff | #ConfluentVUG | @confluentinc

Batches and Buckets @rmoff | #ConfluentVUG | @confluentinc

[paint a better picture - technology] zoom Analytics Applications Tell Us What Happened Respond Photo by Deva Darshan from Pexels → an order was placed! → how many orders were placed @rmoff | #ConfluentVUG | @confluentinc

@rmoff | #ConfluentVUG | @confluentinc

[paint a better picture - technology] zoom Photo by NASA on Unsplash • <city view from above> • It’s the same thing that happened. It’s the same piece of data. we just want different things from it • apps -> respond to something happening (an order was placed!) • analytics -> tell us what happened (how many orders were placed?) • Historically, technology was such you had to. OLTP/OLAP was a compromise; you can have quick data in or quick data out : choose one. • Batch ETL was the inevitable sticking plaster on top of that. Whilst you only had a few systems inhouse from which to get data and one to write it to this didn’t matter. But that’s no longer the case • This isn’t about a compromise, about crowbaring everything into a new shiny technology that I’ve found • this is about adopting a unified platform that enables BOTH apps and analytics to be better (lower latency, more flexible architecture, more scalable) • this is all enabled through events, implemented on a highly scalable, distributed technology with huge integration capabilities and universally-supported API @rmoff | #ConfluentVUG | @confluentinc

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | #ConfluentVUG | @confluentinc

Photo by Mark Kamalov on Unsplash Events

“ An event is both: ✴ Notification ✴ State transfer @rmoff | #ConfluentVUG | @confluentinc

A Customer Experience @rmoff | #ConfluentVUG | @confluentinc

A Sensor Reading @rmoff | #ConfluentVUG | @confluentinc

Databases @rmoff | #ConfluentVUG | @confluentinc

Table Time The Stream/Table Duality Stream Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 @rmoff | #ConfluentVUG | @confluentinc

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash @rmoff | #ConfluentVUG | @confluentinc

Events Basket Bread Tinned Spaghetti @rmoff | #ConfluentVUG | @confluentinc

Events Basket Bread ItemAdd Bread @rmoff | #ConfluentVUG | @confluentinc

Events Basket Bread ItemAdd ItemAdd Bread Baked Beans Baked Beans @rmoff | #ConfluentVUG | @confluentinc

Events Basket Bread ItemAdd ItemAdd ItemRemove Bread Baked Beans Baked Beans @rmoff | #ConfluentVUG | @confluentinc

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Immutable Event Log New Old Messages are added at the end of the log @rmoff | #ConfluentVUG | @confluentinc

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | #ConfluentVUG | @confluentinc

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | #ConfluentVUG | @confluentinc

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | #ConfluentVUG | @confluentinc

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #DataXDays | @confluentinc

Consumers have a position all of their own New Old Sally is here @rmoff | Scan #ConfluentVUG | @confluentinc

Consumers have a position all of their own New Old Fred is here Sally is here Scan @rmoff | Scan #ConfluentVUG | @confluentinc

Consumers have a position all of their own George is here Scan New Old Fred is here Sally is here Scan @rmoff | Scan #ConfluentVUG | @confluentinc

The Connect API Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Streaming Integration with Kafka Connect syslog Sources Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Streaming Integration with Kafka Connect Amazon S3 Google BigQuery Sinks Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Stream Processing in Kafka Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Kafka Streams API final StreamsBuilder builder = new StreamsBuilder() .stream(“orders”, Consumed.with(stringSerde, ordersSerde)) .filter( (key, order) -> order.getStatus().equals(“COMPLETE”) ) .to(“complete_orders”, Produced.with(stringSerde, ordersSerde)); @rmoff | #ConfluentVUG | @confluentinc

Stream Processing with ksqlDB CREATE STREAM completedOrders AS SELECT * FROM orders WHERE status=’COMPLETE’; @rmoff | #ConfluentVUG | @confluentinc

Photo by Ash from Modern Afflatus on Unsplash This is Something New @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events reviews @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events reviews Operational dashboard @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events reviews Operational dashboard Data lake @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events CREATE STREAM reviews_clean AS SELECT * FROM reviews WHERE id IS NOT NULL; reviews reviews_clean Operational dashboard Data lake Filter out bad data @rmoff | #ConfluentVUG | @confluentinc

Events in Action Existing apps User data users Kafka Connect RDBMS txn log Kafka @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events reviews users reviews_clean Operational dashboard User data Data lake @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events CREATE CREATE SELECT SELECT STREAM enriched_reviews AS STREAM reviews_clean AS ** FROM reviews_clean r FROM reviews INNER JOIN users u WHERE id IS NOT NULL ON r.userid=u.userid; reviews users reviews_clean enriched_reviews Operational dashboard User data Data lake Join events to users, and filter @rmoff | #ConfluentVUG | @confluentinc

Events in Action Notification service Review events Operational dashboard User data Data lake @rmoff | #ConfluentVUG | @confluentinc

Events in Action Review events CREATE STREAM unhappy_vips AS SELECT * FROM enriched_reviews WHERE rating Notification< 3 AND status = ‘Platinum’; service reviews users reviews_clean enriched_reviews Operational dashboard unhappy_vips User data Data lake Join events to users, and filter @rmoff | #ConfluentVUG | @confluentinc

Photo by rmoff The Power of an Event-Driven Architecture

Not Everything is a Nail Events RDBMS @rmoff | #ConfluentVUG | @confluentinc

Not Everything is a Nail Events RDBMS @rmoff | #ConfluentVUG | @confluentinc

Not Everything is a Nail Events Elasticsearch RDBMS @rmoff | #ConfluentVUG | @confluentinc

Not Everything is a Nail Graph Events Elasticsearch RDBMS @rmoff | #ConfluentVUG | @confluentinc

Side-by-Side Tech Evaluation Events HDFS @rmoff | #ConfluentVUG | @confluentinc

Side-by-Side Tech Evaluation Events BiqQuery HDFS @rmoff | #ConfluentVUG | @confluentinc

Side-by-Side Tech Evaluation Snowflake Events BiqQuery HDFS @rmoff | #ConfluentVUG | @confluentinc

Evolve Data Sources Producer Consuming App A Onpremises Consuming App B @rmoff | #ConfluentVUG | @confluentinc

Evolve Data Sources Producer Consuming App A Onpremises Consuming App B Producer Cloud @rmoff | #ConfluentVUG | @confluentinc

Evolve Data Sources Consuming App A Consuming App B Producer Cloud @rmoff | #ConfluentVUG | @confluentinc

Tight Coupling != Flexible Orders RDBMS @rmoff | #ConfluentVUG | @confluentinc

Tight Coupling != Flexible Orders RDBMS @rmoff HDFS | #ConfluentVUG | @confluentinc

Tight Coupling != Flexible Orders RDBMS HDFS App @rmoff | #ConfluentVUG | @confluentinc

Loose Coupling == Freedom to Evolve RDBMS Orders @rmoff | #ConfluentVUG | @confluentinc

Loose Coupling == Freedom to Evolve RDBMS Orders HDFS @rmoff | #ConfluentVUG | @confluentinc

Loose Coupling == Freedom to Evolve RDBMS Orders App HDFS @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Cleansing temp_raw App IoT App RDBMS @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_raw App IoT App RDBMS @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_raw Cleanse App IoT App Cleanse RDBMS Cleanse @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_clean sensor_id 42 42 42 App IoT time_epoch 1551136074 1551136125 1551138129 reading 13.05 13.11 13.04 App RDBMS temp_raw SENSOR_ID IS NOT NULL @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Enrichment RDBMS App 01 Events Join @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Enrichment RDBMS App 01 Events Join Elasticsearch App 02 Join @rmoff | #ConfluentVUG | @confluentinc

Transform Once, Use Many: Data Enrichment App 01 Events Elasticsearch RDBMS Join @rmoff | #ConfluentVUG | @confluentinc

Message Payload Compatibility Producer Consuming App @rmoff | #ConfluentVUG | @confluentinc

Message Payload Compatibility Producer Consuming App Producer @rmoff | #ConfluentVUG | @confluentinc

Message Payload Compatibility Producer Consuming App Producer Triangles to Squares @rmoff | #ConfluentVUG | @confluentinc

Build Resilient Pipelines with Schemas COL1 ID INT COL2 NAME VARCHAR sales_csv Apply schema App 01 COL1 ID INT COL2 NAME VARCHAR Producer App 02 @rmoff | Apply schema #ConfluentVUG | @confluentinc

Build Resilient Pipelines with Schemas Schema Registry sales App 01 Producer App 02 sales_csv COL1 ID INT Apply schema COL2 NAME VARCHAR @rmoff | #ConfluentVUG | @confluentinc

Photo by rmoff Say NO to brittle pipelines

App App App App cache monitoring cache MQ MQ security DWH search Hadoop @rmoff | #ConfluentVUG | @confluentinc

App App App App request-response changelogs App App KAFKA App App DWH Hadoop @rmoff | messaging OR stream processing streaming data pipelines #ConfluentVUG | @confluentinc

Photo by rmoff Events model the real world

Event streaming platform Photo by rmoff Native stream processing Data when you need it Data persistence Flexibility & scalability

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry)

Free Books! https://rmoff.dev/q2m @rmoff | #ConfluentVUG | @confluentinc

60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Confluent Community Slack group cnfl.io/slack @rmoff | #ConfluentVUG | @confluentinc

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up http://rmoff.dev/youtube • https://rmoff.dev/kafka101 • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | #ConfluentVUG | @confluentinc

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op @rmoff | #ConfluentVUG | @confluentinc