The Changing Face of ETL: Event-Driven Architectures for Data Engineers

A presentation at Confluent VUG in July 2020 in by Robin Moffatt

Slide 1

Slide 1

The Changing Face of ETL Event-Driven Architectures for Data Engineers Photo by rmoff @rmoff

Slide 2

Slide 2

Photo by Samuel Sianipar on Unsplash

Slide 3

Slide 3

Photo by Khai Sze Ong on Unsplash

Slide 4

Slide 4

Photo by Rainier Ridao on Unsplash

Slide 5

Slide 5

Photo by Rohit Tandon on Unsplash

Slide 6

Slide 6

Photo by Theodore Moore on Unsplash

Slide 7

Slide 7

Photo by Cristian Grecu on Unsplash

Slide 8

Slide 8

Photo by Patrick Fore on Unsplash It used to be so simple @rmoff | #ConfluentVUG | @confluentinc

Slide 9

Slide 9

Photo by Eugenio Mazzone on Unsplash More More Sources Sources @rmoff | #ConfluentVUG | @confluentinc

Slide 10

Slide 10

Photo by Tom Barrett on Unsplash More Targets More Targets @rmoff | #ConfluentVUG | @confluentinc

Slide 11

Slide 11

Photo by Kirill on Unsplash More Data More Data @rmoff | #ConfluentVUG | @confluentinc

Slide 12

Slide 12

Batches and Buckets @rmoff | #ConfluentVUG | @confluentinc

Slide 13

Slide 13

[paint a better picture - technology] zoom Analytics Applications Tell Us What Happened Respond Photo by Deva Darshan from Pexels → an order was placed! → how many orders were placed @rmoff | #ConfluentVUG | @confluentinc

Slide 14

Slide 14

@rmoff | #ConfluentVUG | @confluentinc

Slide 15

Slide 15

[paint a better picture - technology] zoom Photo by NASA on Unsplash • <city view from above> • It’s the same thing that happened. It’s the same piece of data. we just want different things from it • apps -> respond to something happening (an order was placed!) • analytics -> tell us what happened (how many orders were placed?) • Historically, technology was such you had to. OLTP/OLAP was a compromise; you can have quick data in or quick data out : choose one. • Batch ETL was the inevitable sticking plaster on top of that. Whilst you only had a few systems inhouse from which to get data and one to write it to this didn’t matter. But that’s no longer the case • This isn’t about a compromise, about crowbaring everything into a new shiny technology that I’ve found • this is about adopting a unified platform that enables BOTH apps and analytics to be better (lower latency, more flexible architecture, more scalable) • this is all enabled through events, implemented on a highly scalable, distributed technology with huge integration capabilities and universally-supported API @rmoff | #ConfluentVUG | @confluentinc

Slide 16

Slide 16

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | #ConfluentVUG | @confluentinc

Slide 17

Slide 17

Photo by Mark Kamalov on Unsplash Events

Slide 18

Slide 18

“ An event is both: ✴ Notification ✴ State transfer @rmoff | #ConfluentVUG | @confluentinc

Slide 19

Slide 19

A Customer Experience @rmoff | #ConfluentVUG | @confluentinc

Slide 20

Slide 20

A Sensor Reading @rmoff | #ConfluentVUG | @confluentinc

Slide 21

Slide 21

Databases @rmoff | #ConfluentVUG | @confluentinc

Slide 22

Slide 22

Table Time The Stream/Table Duality Stream Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 @rmoff | #ConfluentVUG | @confluentinc

Slide 23

Slide 23

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash @rmoff | #ConfluentVUG | @confluentinc

Slide 24

Slide 24

Events Basket Bread Tinned Spaghetti @rmoff | #ConfluentVUG | @confluentinc

Slide 25

Slide 25

Events Basket Bread ItemAdd Bread @rmoff | #ConfluentVUG | @confluentinc

Slide 26

Slide 26

Events Basket Bread ItemAdd ItemAdd Bread Baked Beans Baked Beans @rmoff | #ConfluentVUG | @confluentinc

Slide 27

Slide 27

Events Basket Bread ItemAdd ItemAdd ItemRemove Bread Baked Beans Baked Beans @rmoff | #ConfluentVUG | @confluentinc

Slide 28

Slide 28

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Slide 29

Slide 29

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Slide 30

Slide 30

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Slide 31

Slide 31

Events Basket Bread ItemAdd ItemAdd ItemRemove ItemAdd Bread Baked Beans Baked Beans Tinned Spaghetti @rmoff | #ConfluentVUG Tinned Spaghetti | @confluentinc

Slide 32

Slide 32

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Slide 33

Slide 33

Immutable Event Log New Old Messages are added at the end of the log @rmoff | #ConfluentVUG | @confluentinc

Slide 34

Slide 34

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | #ConfluentVUG | @confluentinc

Slide 35

Slide 35

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | #ConfluentVUG | @confluentinc

Slide 36

Slide 36

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | #ConfluentVUG | @confluentinc

Slide 37

Slide 37

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #DataXDays | @confluentinc

Slide 38

Slide 38

Consumers have a position all of their own New Old Sally is here @rmoff | Scan #ConfluentVUG | @confluentinc

Slide 39

Slide 39

Consumers have a position all of their own New Old Fred is here Sally is here Scan @rmoff | Scan #ConfluentVUG | @confluentinc

Slide 40

Slide 40

Consumers have a position all of their own George is here Scan New Old Fred is here Sally is here Scan @rmoff | Scan #ConfluentVUG | @confluentinc

Slide 41

Slide 41

The Connect API Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Slide 42

Slide 42

Streaming Integration with Kafka Connect syslog Sources Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Slide 43

Slide 43

Streaming Integration with Kafka Connect Amazon S3 Google BigQuery Sinks Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Slide 44

Slide 44

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Tasks Workers @rmoff | Kafka Connect Kafka Brokers #ConfluentVUG | @confluentinc

Slide 45

Slide 45

Stream Processing in Kafka Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Slide 46

Slide 46

Kafka Streams API final StreamsBuilder builder = new StreamsBuilder() .stream(“orders”, Consumed.with(stringSerde, ordersSerde)) .filter( (key, order) -> order.getStatus().equals(“COMPLETE”) ) .to(“complete_orders”, Produced.with(stringSerde, ordersSerde)); @rmoff | #ConfluentVUG | @confluentinc

Slide 47

Slide 47

Stream Processing with ksqlDB CREATE STREAM completedOrders AS SELECT * FROM orders WHERE status=’COMPLETE’; @rmoff | #ConfluentVUG | @confluentinc

Slide 48

Slide 48

Photo by Ash from Modern Afflatus on Unsplash This is Something New @rmoff | #ConfluentVUG | @confluentinc

Slide 49

Slide 49

Events in Action Review events reviews @rmoff | #ConfluentVUG | @confluentinc

Slide 50

Slide 50

Events in Action Review events reviews Operational dashboard @rmoff | #ConfluentVUG | @confluentinc

Slide 51

Slide 51

Events in Action Review events reviews Operational dashboard Data lake @rmoff | #ConfluentVUG | @confluentinc

Slide 52

Slide 52

Events in Action Review events CREATE STREAM reviews_clean AS SELECT * FROM reviews WHERE id IS NOT NULL; reviews reviews_clean Operational dashboard Data lake Filter out bad data @rmoff | #ConfluentVUG | @confluentinc

Slide 53

Slide 53

Events in Action Existing apps User data users Kafka Connect RDBMS txn log Kafka @rmoff | #ConfluentVUG | @confluentinc

Slide 54

Slide 54

Events in Action Review events reviews users reviews_clean Operational dashboard User data Data lake @rmoff | #ConfluentVUG | @confluentinc

Slide 55

Slide 55

Events in Action Review events CREATE CREATE SELECT SELECT STREAM enriched_reviews AS STREAM reviews_clean AS ** FROM reviews_clean r FROM reviews INNER JOIN users u WHERE id IS NOT NULL ON r.userid=u.userid; reviews users reviews_clean enriched_reviews Operational dashboard User data Data lake Join events to users, and filter @rmoff | #ConfluentVUG | @confluentinc

Slide 56

Slide 56

Events in Action Notification service Review events Operational dashboard User data Data lake @rmoff | #ConfluentVUG | @confluentinc

Slide 57

Slide 57

Events in Action Review events CREATE STREAM unhappy_vips AS SELECT * FROM enriched_reviews WHERE rating Notification< 3 AND status = ‘Platinum’; service reviews users reviews_clean enriched_reviews Operational dashboard unhappy_vips User data Data lake Join events to users, and filter @rmoff | #ConfluentVUG | @confluentinc

Slide 58

Slide 58

Photo by rmoff The Power of an Event-Driven Architecture

Slide 59

Slide 59

Not Everything is a Nail Events RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 60

Slide 60

Not Everything is a Nail Events RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 61

Slide 61

Not Everything is a Nail Events Elasticsearch RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 62

Slide 62

Not Everything is a Nail Graph Events Elasticsearch RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 63

Slide 63

Side-by-Side Tech Evaluation Events HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 64

Slide 64

Side-by-Side Tech Evaluation Events BiqQuery HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 65

Slide 65

Side-by-Side Tech Evaluation Snowflake Events BiqQuery HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 66

Slide 66

Evolve Data Sources Producer Consuming App A Onpremises Consuming App B @rmoff | #ConfluentVUG | @confluentinc

Slide 67

Slide 67

Evolve Data Sources Producer Consuming App A Onpremises Consuming App B Producer Cloud @rmoff | #ConfluentVUG | @confluentinc

Slide 68

Slide 68

Evolve Data Sources Consuming App A Consuming App B Producer Cloud @rmoff | #ConfluentVUG | @confluentinc

Slide 69

Slide 69

Tight Coupling != Flexible Orders RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 70

Slide 70

Tight Coupling != Flexible Orders RDBMS @rmoff HDFS | #ConfluentVUG | @confluentinc

Slide 71

Slide 71

Tight Coupling != Flexible Orders RDBMS HDFS App @rmoff | #ConfluentVUG | @confluentinc

Slide 72

Slide 72

Loose Coupling == Freedom to Evolve RDBMS Orders @rmoff | #ConfluentVUG | @confluentinc

Slide 73

Slide 73

Loose Coupling == Freedom to Evolve RDBMS Orders HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 74

Slide 74

Loose Coupling == Freedom to Evolve RDBMS Orders App HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 75

Slide 75

Transform Once, Use Many: Data Cleansing temp_raw App IoT App RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 76

Slide 76

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_raw App IoT App RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 77

Slide 77

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_raw Cleanse App IoT App Cleanse RDBMS Cleanse @rmoff | #ConfluentVUG | @confluentinc

Slide 78

Slide 78

Transform Once, Use Many: Data Cleansing sensor_id time_epoch 42 1551136074 42 1551136125 1551136125 42 1551138129 reading 13.05 13.11 13.11 13.04 temp_clean sensor_id 42 42 42 App IoT time_epoch 1551136074 1551136125 1551138129 reading 13.05 13.11 13.04 App RDBMS temp_raw SENSOR_ID IS NOT NULL @rmoff | #ConfluentVUG | @confluentinc

Slide 79

Slide 79

Transform Once, Use Many: Data Enrichment RDBMS App 01 Events Join @rmoff | #ConfluentVUG | @confluentinc

Slide 80

Slide 80

Transform Once, Use Many: Data Enrichment RDBMS App 01 Events Join Elasticsearch App 02 Join @rmoff | #ConfluentVUG | @confluentinc

Slide 81

Slide 81

Transform Once, Use Many: Data Enrichment App 01 Events Elasticsearch RDBMS Join @rmoff | #ConfluentVUG | @confluentinc

Slide 82

Slide 82

Message Payload Compatibility Producer Consuming App @rmoff | #ConfluentVUG | @confluentinc

Slide 83

Slide 83

Message Payload Compatibility Producer Consuming App Producer @rmoff | #ConfluentVUG | @confluentinc

Slide 84

Slide 84

Message Payload Compatibility Producer Consuming App Producer Triangles to Squares @rmoff | #ConfluentVUG | @confluentinc

Slide 85

Slide 85

Build Resilient Pipelines with Schemas COL1 ID INT COL2 NAME VARCHAR sales_csv Apply schema App 01 COL1 ID INT COL2 NAME VARCHAR Producer App 02 @rmoff | Apply schema #ConfluentVUG | @confluentinc

Slide 86

Slide 86

Build Resilient Pipelines with Schemas Schema Registry sales App 01 Producer App 02 sales_csv COL1 ID INT Apply schema COL2 NAME VARCHAR @rmoff | #ConfluentVUG | @confluentinc

Slide 87

Slide 87

Photo by rmoff Say NO to brittle pipelines

Slide 88

Slide 88

App App App App cache monitoring cache MQ MQ security DWH search Hadoop @rmoff | #ConfluentVUG | @confluentinc

Slide 89

Slide 89

App App App App request-response changelogs App App KAFKA App App DWH Hadoop @rmoff | messaging OR stream processing streaming data pipelines #ConfluentVUG | @confluentinc

Slide 90

Slide 90

Photo by rmoff Events model the real world

Slide 91

Slide 91

Event streaming platform Photo by rmoff Native stream processing Data when you need it Data persistence Flexibility & scalability

Slide 92

Slide 92

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry)

Slide 93

Slide 93

Free Books! https://rmoff.dev/q2m @rmoff | #ConfluentVUG | @confluentinc

Slide 94

Slide 94

60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.

Slide 95

Slide 95

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 96

Slide 96

Confluent Community Slack group cnfl.io/slack @rmoff | #ConfluentVUG | @confluentinc

Slide 97

Slide 97

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up http://rmoff.dev/youtube • https://rmoff.dev/kafka101 • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | #ConfluentVUG | @confluentinc

Slide 98

Slide 98

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op @rmoff | #ConfluentVUG | @confluentinc