Batch vs. real-time :
Batch processing systems like Hadoop MapReduce paved the way to process large volumes of data. Since then, products are demanding to act on fresher data, faster. Companies building apps like leaderboards and real-time personalization can’t wait tens of minutes or hours to query data to understand what the customers are doing. It’s very common to dump streaming data from sources like Kafka into these batch processing systems, like a data lake, such as S3, or a data warehouse to do more complex analytics, such as Redshift. Engineers are squeezing (balancing speed & performance with cost) these batch processing systems to try to act on fresh data faster. That’s because these batch processing technologies are not meant for these use cases, like real-time personalization-- there’s a mismatch.
In short, you’re losing the value of the real-time-ness that the data streams provide because, by the time you act on it with these batch processing systems, the data is tens of minutes old.
Do you agree/disagree that many companies are squeezing their batch processing system and losing the real-time-ness that the data streams provide? How are you handling your streaming data?