I've been catching up on the last few years of data infrastructure lately and wrote up some notes, very curious what I missed and what I got wrong:https://lethain.com/from-lambda-to-kappa-dataflow-paradigms/ …
-
-
Replying to @Lethain
In my particular bubble, Beam (deployed on GCP, esp with Scio as a wrapper) is getting more mindshare than Flink, and Spark is heavily used for batch (basically just as a next-gen MapReduce) but much less so for streaming.
1 reply 0 retweets 3 likes -
Replying to @avibryant @Lethain
It's also worth thinking about the rise of the distributed OLAP column store (Redshift, BigQuery, Presto) as something that everyone can use vs only enterprises paying for Vertica. Feeding Parquet files to these becomes a major function of your data systems.
1 reply 0 retweets 4 likes -
Replying to @avibryant @Lethain
Finally: I remain attached to Lambda architecture at least for the reason that having a "hot storage" vs "cold storage" distinction built in can lead to big cost savings, and because I view the "parallel implementation" problem as largely solved by Summingbird etc.
1 reply 0 retweets 1 like -
Replying to @avibryant
Cost savings is an interesting point. Kappa seems like a superset of Lambda, so could be recreated there with various tiers. On Beam/Flink, my naive sense is that Flink is interesting as a non-GCP coupled runner for Beam, and otherwise only native-streaming aspect stands out.
1 reply 0 retweets 1 like
Sure, but "recreating Kappa with various tiers" feels like Lambda to me and I think I'd rather be explicit about it vs building leaky abstractions (eg by pretending that events archived on S3 are a "streaming" source).
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.