Weekly discussion thread: Nov1st-Nov7th

Batch vs. real-time :stopwatch::
Batch processing systems like Hadoop MapReduce paved the way to process large volumes of data. Since then, products are demanding to act on fresher data, faster. Companies building apps like leaderboards and real-time personalization can’t wait tens of minutes or hours to query data to understand what the customers are doing. It’s very common to dump streaming data from sources like Kafka into these batch processing systems, like a data lake, such as S3, or a data warehouse to do more complex analytics, such as Redshift. Engineers are squeezing (balancing speed & performance with cost) these batch processing systems to try to act on fresh data faster. That’s because these batch processing technologies are not meant for these use cases, like real-time personalization-- there’s a mismatch.

In short, you’re losing the value of the real-time-ness that the data streams provide because, by the time you act on it with these batch processing systems, the data is tens of minutes old.

Do you agree/disagree that many companies are squeezing their batch processing system and losing the real-time-ness that the data streams provide? How are you handling your streaming data?

if there were no costs involved, full real-time streaming ELT would be the ideal world for sure. unfortunately that’s usually not the case today - achieving streaming ELT becomes slow quickly when you add any kind of complexity to your data transformations. with batch ELT, you can still use things like micro-batching to try and achieve closer to real-time data latency. but if we all had materialized views that didn’t come at significant additional costs or overhead, it would definitely be ideal

1 Like