I've been catching up on the last few years of data infrastructure lately and wrote up some notes, very curious what I missed and what I got wrong:https://lethain.com/from-lambda-to-kappa-dataflow-paradigms/ …
-
-
It's also worth thinking about the rise of the distributed OLAP column store (Redshift, BigQuery, Presto) as something that everyone can use vs only enterprises paying for Vertica. Feeding Parquet files to these becomes a major function of your data systems.
-
Finally: I remain attached to Lambda architecture at least for the reason that having a "hot storage" vs "cold storage" distinction built in can lead to big cost savings, and because I view the "parallel implementation" problem as largely solved by Summingbird etc.
-
Cost savings is an interesting point. Kappa seems like a superset of Lambda, so could be recreated there with various tiers. On Beam/Flink, my naive sense is that Flink is interesting as a non-GCP coupled runner for Beam, and otherwise only native-streaming aspect stands out.
-
Sure, but "recreating Kappa with various tiers" feels like Lambda to me and I think I'd rather be explicit about it vs building leaky abstractions (eg by pretending that events archived on S3 are a "streaming" source).
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.