“流处理 vs. 批处理”是一个错误的二分法

360影视 欧美动漫 2025-05-18 14:39 3

摘要:Often times, "Stream vs. Batch" is discussed as if it’s oneorthe other, but to me this does not make that much sense really.很多时候,“

Often times, "Stream vs. Batch" is discussed as if it’s one or the other, but to me this does not make that much sense really.
很多时候,“Stream vs. Batch” 被讨论为非此彼,但对我来说,这并没有多大意义。

Many streaming systems will apply batching too, i.e. processing or transferring multiple records (a "batch") at once, thus offsetting connection overhead, amortizing the cost of fanning out work to multiple threads, opening the door for highly efficient SIMD processing, etc., all to ensure high performance.

The prevailing trend towards storage/compute separation in data streaming and processing architectures (for instance, thinking of platforms such as WarpStream, and Diskless Kafka at large) further accelerates this development.
许多流系统也将应用批处理,即一次处理或传输多条记录(“批处理”),从而抵消连接开销,将工作扇出的成本分摊到多个线程,为高效的 SIMD 处理打开大门等,所有这些都是为了确保高性能。数据流和处理架构中存储/计算分离的普遍趋势(例如,考虑 WarpStream 和整个无盘 Kafka 等平台)进一步加速了这一发展。

Typically, this is happening transparently to users, done in an opportunistic way: handling all of those records (up to some limit) which have arrived in a buffer since the last Batch. This makes for a very nice self-regulating system.

High arrival rate of records: larger batches, improving throughput. Low arrival rate: smaller batches, perhaps with even just a single record, ensuring low latency. Columnar in-memory data formats like Apache Arrow are of great help for implementing such a design.
通常,这对用户是透明的,以机会主义的方式完成:处理自上一批以来到达缓冲区的所有这些记录(最多达到某个限制)。这形成了一个非常好的自我调节系统。记录到达率高:批次更大,提高吞吐量。低到达率:较小的批次,甚至可能只有一条记录,确保低延迟。像 Apache Arrow 这样的列式内存数据格式对于实现这样的设计有很大帮助。

In contrast, what the "Stream vs. Batch" discussion in my opinion should actually be about, are "Pull vs. Push" semantics: will the system query its sources for new records in a fixed interval, or will new records be pushed to the system as soon as possible?

Now, no matter how often you pull, you can’t convert a pull-based solution into a streaming one. Unless a source represents a consumable stream of changes itself (you see where this is going), a pull system may miss updates happening between fetch attempts, as well as deletes.
相比之下,在我看来,“Stream vs. Batch”的讨论实际上应该是关于“Pull vs. Push”语义:系统会在固定的时间间隔内查询其源以获取新记录,还是会尽快将新记录推送到系统?现在,无论您多久拉取一次,都无法将基于拉取的解决方案转换为流式解决方案。除非源本身代表可消费的更改流(您知道这是怎么回事),否则拉取系统可能会错过在获取尝试和删除之间发生的更新。

This is what makes streaming so interesting and powerful: it provides you with a complete view of your data in real-time. A streaming system lets you put your data to the location where you need it, in the format you need it, and in the shape you need it (think denormalization), immediately as it gets produced or updated.

The price for this is a potentially higher complexity, for example when reasoning about streaming joins (and their state), or handling out-of-order data. But the streaming community is working continuously to improve things here, e.g. via disaggregated state backends, transactional stream processing, and much more. I’m really excited about all the innovation happening in this space right now.


这就是流式处理如此有趣和强大的原因:它为您提供实时数据的完整视图。流系统允许您将数据放在需要的位置、所需的格式形状 (想想非规范化),在数据生成或更新时立即。

这样做的代价是可能更高的复杂性,例如,在推理流式连接(及其状态)或处理无序数据时。但是流社区正在不断努力改进这里的事情,例如通过分解的状态后端、事务流处理等等。我对这个领域现在发生的所有创新感到非常兴奋。

Now, you might wonder: "Do I really need streaming (push), though? I’m fine with batch (pull)."
现在,您可能会想:“不过,我真的需要流式处理 (push) 吗?我对批处理 (拉) 没问题。

That’s a common and fair question. In my experience, it is best answered by giving it a try yourself. Again and again I have seen how folks who were skeptical at first, very quickly wanted to get real-time streaming for more and more, if not all of their use cases, once they had seen it in action once. If you’ve experienced a data freshness of a second or two in your data warehouse, you don’t want to ever miss this magic again.
这是一个常见且公平的问题。根据我的经验,最好自己试一试来回答。我一次又一次地看到,起初持怀疑态度的人们,一旦他们曾经看到过实时流,他们很快就希望为越来越多的用例(如果不是全部)获得实时流。如果您在数据仓库中体验过一两秒的数据新鲜度,那么您肯定不想再错过这种神奇之处。

All that being said, it’s actually not even about pull or push so much—the approaches complement each other. For instance, backfills often are done via batching, i.e. querying, in an otherwise streaming-based system. Also, if you want the completeness of streaming but don’t require a super low latency, you may decide to suspend your streaming pipelines (thus saving cost) in times of low data volume, resume when there’s new data to process, and halt again.
话虽如此,实际上甚至与拉推无关——这些方法是相辅相成的。例如,回填通常是通过批处理(即查询)在其他基于流的系统中完成的。此外,如果您想要流式处理的完整性,但不需要超低延迟,则可以决定在数据量较低时暂停流式处理管道(从而节省成本),在有新数据要处理时恢复,然后再次停止。

Batch streaming, if you will.
批量流式处理(如果愿意)。

来源:亲爱的数据一点号

相关推荐