小h漫谈（8）：大数据计算模式

摘要：当人们提到大数据时就会很自然地想到MapReduce，可见其影响力之广。实际上，大数据处理的问题复杂多样，单一的计算模式是无法满足不同类型的计算需求的。

分享兴趣，传播快乐，增长见闻，留下美好！

亲爱的您，这里是LearningYard新学苑。

今天小编为大家带来文章

“小h漫谈（8）：大数据计算模式”

欢迎您的访问。

Share interest, spread happiness, increase knowledge, leave a beautiful!

Dear, this is LearningYard Academy.

Today Xiaobian brings you an article

"Little H's Chat (8): Big Data Computing Patterns"

Welcome to your visit.

一、思维导图（Mind mapping）

二、精读内容（Intensive Reading Content）

当人们提到大数据时就会很自然地想到MapReduce，可见其影响力之广。实际上，大数据处理的问题复杂多样，单一的计算模式是无法满足不同类型的计算需求的。

When people mention big data, they naturally think of MapReduce, which shows its widespread influence. In fact, the problems of big data processing are complex and diverse, and a single computing model cannot meet the needs of different types of computations.

MapReduce 其实只是大数据计算模式中的一种，它代表了针对大规模数据的批量处理技术。除此以外，还有批处理计算、流计算、图计算、查询分析计算等多种大数据计算模式。

MapReduce is actually just one of the big data computing models, representing batch processing technology for large-scale data. In addition to this, there are also various big data computing models such as Batch computing, stream computing, graph computing, and query analysis computing.

批处理计算（Batch Computing）

批处理计算主要解决针对大规模数据的批量处理，也是我们日常数据分析工作中非常常见的一类数据处理需求。MapReduce 是最具有代表性和影响力的大数据批处理技术，可以并行执行大规模数据处理任务，用于大规模数据集（大于1TB）的并行运算。

Batch computing mainly addresses the batch processing of large-scale data, which is also a very common type of data processing demand in our daily data analysis work. MapReduce is the most representative and influential big data batch processing technology, capable of parallel execution of large-scale data processing tasks for large data sets (greater than 1TB) in parallel computation.

Spark是一个针对超大数据集合的低延迟的集群分布式计算系统，比 MapReduce 快许多。Spark启用了内存分布数据集，除了能够提供交互式查询外，还可以优化迭代工作负载。在 MapReduce中，数据流从一个稳定的来源进行一系列加工处理后，流出到一个稳定的文件系统（如 HDFS )。而 Spark 使用内存替代 HDFS 或本地磁盘来存储中间结果，因此 Spark 要比 MapReduce 的速度快许多。

Spark is a low-latency cluster distributed computing system for ultra-large data sets, much faster than MapReduce. Spark enables in-memory distributed data sets, which not only provides interactive queries but also optimizes iterative workloads. In MapReduce, data flows from a stable source through a series of processing steps and out to a stable file system (such as HDFS). Spark, on the other hand, uses memory instead of HDFS or local disks to store intermediate results, making Spark much faster than MapReduce.

流计算（Stream Computing）

流数据也是大数据分析中的重要数据类型。流数据（或数据流）是指在时间分布和数量上无限的一系列动态数据集合体，数据的价值随着时间的流逝而降低，因此必须采用实时计算的方式给出秒级响应。

Stream data is also an important type of data in big data analysis. Stream data (or data streams) refers to an unlimited series of dynamic data sets distributed over time and in quantity, the value of the data decreases with the passage of time, so real-time computing methods must be used to provide responses in seconds.

流计算可以实时处理来自不同数据源的、连续到达的流数据，经过实时分析处理，经出有价值的分析结果。目前业内已涌现出许多的流计算框架平台，包括IBM InfoSphere Streams、IBM StreamBase、Twitter Storm、 Yahoo! S4、Spark Streaming、Flink 等。

Stream computing can process stream data arriving continuously from different data sources in real-time, and after real-time analysis and processing, it can produce valuable analysis results. Currently, many stream computing framework platforms have emerged in the industry, including IBM InfoSphere Streams, IBM StreamBase, Twitter Storm, Yahoo! S4, Spark Streaming, Flink, etc.

图计算（Graph Computing）

在大数据时代，许多大数据都是以大规模图或网络的形式呈现的，如社交网络、传染病传播途径、交通事故对路网的影响等。此外，许多非图结构的大数据也常常会被转换为图模型后再进行处理分析。

In the era of big data, much of the data is presented in the form of large-scale graphs or networks, such as social networks, pathways of infectious disease spread, and the impact of traffic accidents on road networks. In addition, many non-graph structured big data are often converted into graph models for processing and analysis.

因此，针对大型图的计算，需要采用图计算模式，目前已经出现了不少相关图计算产品。比如谷歌公司的 Pregel 就是一个用于分布式图计算的计算框架，主要用于 PageRank 计算、最短路径和图遍历等。其他代表性的图计算产品还包括 Spark生态系统中的 GraphX、Flink生态系统中的 Gelly等。

Therefore, for the computation of large graphs, graph computing models are required, and many related graph computing products have already emerged. For example, Google's Pregel is a computing framework for distributed graph computing, mainly used for PageRank calculations, shortest path, and graph traversal. Other representative graph computing products include GraphX in the Spark ecosystem and Gelly in the Flink ecosystem.

查询分析计算（Query Analysis Computing）

针对超大规模数据的存储管理和查询分析，需要提供实时或准实时的响应，才能很好地满足企业经营管理需求。谷歌公司开发的 Dremel 是一种可扩展的、交互式的实时查询系统，用于只读嵌套数据的分析，它能做到几秒内完成对万亿张表的聚合查询。

For the storage management and query analysis of ultra-large-scale data, real-time or near-real-time responses are needed to meet the operational management needs of enterprises effectively. Google developed Dremel, a scalable, interactive real-time query system for the analysis of read-only nested data, capable of completing aggregate queries on trillions of tables within seconds.

此外，Cloudera 公司参考 Dremel 系统开发了实时查询引擎 Impala，它提供结构化查询语言 (SQL)语义，能快速查询存储在Hadoop 的 HDFS 和 HBase 中的 PB级大数据。

In addition, Cloudera developed the real-time query engine Impala, based on the Dremel system, which provides Structured Query Language (SQL) semantics and can quickly query petabytes of big data stored in Hadoop's HDFS and HBase.

今天的分享就到这里了

如果您对今天的文章有独特的想法

欢迎给我们留言

让我们相约明天

祝您今天过得开心快乐！

That's all for today's sharing.

If you have a unique idea about the article

please leave us a message

and let us meet tomorrow

I wish you a nice day!

文案|小h

排版|小h