多区域一致性：鱼与熊掌兼得！

https://www.synadia.com/blog/multi-cluster-consistency-models

摘要 Abstract

Using stretch clusters and virtual streams to improve latency and availability in global NATS deployments. Includes a walk-through and associated repository as a practical example.
本文介绍了如何利用扩展集群和虚拟流来提升全球 NATS 部署的延迟和可用性。文中包含详细的步骤说明和相关的代码库，以提供实际示例。

引言 Introuction

As field CTO at Synadia I have had the chance to work with some of the most interesting customer use cases for deploying NATS globally over multiple locations. Whether it’s the need for the data to be closer to the users to meet latency requirements, or the need to be resilient to a disaster such as a site or cloud provider regional outage, or even for regulatory requirements, many companies are looking to deploy their applications over multiple availability zones, sites, regions or cloud providers. And when you step into these kinds of geographically distributed deployments you need to worry about the distribution, replication and consistency of your data, both for ‘reads’ and ‘writes’.
作为 Synadia 的现场首席技术官，我有幸接触到一些非常有趣的客户用例，这些用例涉及在全球多个地点部署 NATS。无论是为了满足延迟要求，使数据更靠近用户；还是为了应对诸如站点或云提供商区域性中断等灾难，甚至是出于监管要求，许多公司都在寻求将其应用程序部署到多个可用区、站点、区域或云提供商。而当您进行这类地理分布式部署时，您需要关注数据的分布、复制和一致性，包括“读取”和“写入”操作。

In case you’re not familiar with consistency models in distributed data stores, just know that one of the ways distributed data stores can be classified is by their distributed consistency model: they can be either ‘eventually’ consistent or ‘immediately’ consistent. Both models have their advantages and inconveniences: immediately consistent systems can offer features such as distributed shared queuing for distribution of messages between consumers or ‘compare and set’ operations for concurrency access control that are not possible with eventually consistent systems. On the other hand, eventually consistent systems can offer lower latency and better availability.
如果您不熟悉分布式数据存储中的一致性模型，只需知道分布式数据存储的分类方式之一是根据其分布式一致性模型：它们可以是“最终一致性”的，也可以是“立即一致性”的。两种模型各有优缺点：立即一致性系统可以提供一些特性，例如用于在消费者之间分发消息的分布式共享队列，或用于并发访问控制的“比较并设置”操作，而这些特性在最终一致性系统中是无法实现的。另一方面，最终一致性系统可以提供更低的延迟和更高的可用性。

This blog post will review the spectrum of options offered by NATS JetStream in terms of replication and consistency when deployed over multiple availability zones, sites, regions or cloud providers and how you can have access to both eventual or immediate consistency at the same time.
本文将回顾 NATS JetStream 在跨多个可用区、站点、区域或云提供商部署时，在复制和一致性方面提供的各种选项，以及如何同时实现最终一致性和即时一致性。

What is described here applies just as well to a deployment over multiple data centers, multiple regions within the same cloud provider, or multiple cloud providers or any combination thereof, but for the sake of simplicity, we will use the term ‘regions’ (named ‘east’, ‘central’ and ‘west’ in the examples) in the rest of this blog post.
本文所述内容同样适用于跨多个数据中心、同一云提供商内的多个区域、多个云提供商或其任意组合的部署。为简便起见，本文后续部分将使用“区域”（示例中分别命名为“东部”、“中部”和“西部”）这一术语。

The concepts in this blog are purely about high availability and local data access and storage even in the face of disasters (entire regions going down or getting isolated from the other regions), if you need to extend NATS JetStream service to the edge and to places where the network connectivity is not always guaranteed, for example vehicles or mobile devices connecting over cellular networks, you should be looking at using Leaf Nodes instead.
本文主要探讨高可用性以及即使在灾难发生时（例如整个区域宕机或与其他区域隔离）也能实现本地数据访问和存储。如果您需要将 NATS JetStream 服务扩展到网络边缘以及网络连接无法始终得到保证的地区（例如通过蜂窝网络连接的车辆或移动设备），则应考虑使用叶节点。

集群式 JetStream Clustered JetStream

Within a cluster of NATS servers, JetStream offers immediate consistency using a RAFT voting protocol between the servers that replicate a stream. When a client application receives the publication acknowledgement it is assured that the message has been safely replicated to (and persisted by) the majority of the servers.
在 NATS 服务器集群中，JetStream 使用 RAFT 投票协议在复制流的服务器之间提供即时一致性。当客户端应用程序收到发布确认时，即可确保消息已安全复制到（并持久化）大多数服务器。

The replicas for a particular stream are picked from the set of JetStream enabled servers in the cluster. So for example, if you have a cluster of 9 servers only 3 of them will be involved in the message storing and RAFT voting of an R3 (3 replicas) stream. NATS’s location transparency means that the client application can be connected to any server in the cluster and still be able to publish and consume from the stream.
特定流的副本从集群中启用 JetStream 的服务器集合中选择。例如，如果您有一个包含 9 台服务器的集群，则只有其中 3 台服务器会参与 R3（3 个副本）流的消息存储和 RAFT 投票。NATS 的位置透明性意味着客户端应用程序可以连接到集群中的任何服务器，并且仍然能够发布和使用流。

JetStream allows you to control the placement of the stream replicas using placement tags. For example, you can enhance availability by placing your servers in different availability zones within the same region/data center. You can then ensure using stream placement tags that the stream doesn’t get placed on two servers in the same availability zone. You can also adjust the replication degree up and down at any time without interrupting the service to the stream, and even change the placement tags of the stream to move it to a different set of servers (also without interrupting the service).
JetStream 允许您使用放置标签来控制流副本的位置。例如，您可以通过将服务器放置在同一区域/数据中心的不同可用区来提高可用性。然后，您可以使用流放置标签来确保流不会放置在同一可用区中的两台服务器上。您还可以随时调整复制程度，而不会中断流的服务，甚至可以更改流的放置标签，将其移动到不同的服务器组（也不会中断服务）。

single-cluster
图中展示了一个包含 3 台服务器、跨越云区域中 3 个可用区的集群。

Multi-Cluster JetStream 多集群 JetStream

When you want to extend the JetStream system to multiple cloud providers/regions/data centers, you can use the JetStream Gateway feature to create a Super-Cluster. This feature allows you to connect clusters together such that you would have one cluster per cloud provider/region/data center. The location transparency of NATS and JetStream still applies in Super-Clusters: a client application can be connected to any server in any cluster and still transparently be able to publish and consume from the streams regardless of where the stream’s replicating servers are located.
当您想要将 JetStream 系统扩展到多个云提供商/区域/数据中心时，可以使用 JetStream Gateway 功能创建超级集群。此功能允许您将多个集群连接在一起，从而为每个云提供商/区域/数据中心创建一个集群。NATS 和 JetStream 的位置透明性在超级集群中仍然适用：客户端应用程序可以连接到任何集群中的任何服务器，并且无论流的复制服务器位于何处，客户端应用程序仍然可以透明地发布和使用流。

Drawing of a Super-Cluster spanning 3 regions.
跨越三个区域的超级集群示意图。

超级集群中的操作延迟 Latency of operations in a Super-Cluster

This location transparency of NATS Super-Clusters is however still subject to the laws of physics and network latencies: operations on a stream located in a different cluster will have higher latency than operations on a stream located in the same cluster as the client application.
NATS 超级集群的这种位置透明性仍然受物理定律和网络延迟的影响：对位于不同集群中的数据流进行操作的延迟将高于对与客户端应用程序位于同一集群中的数据流进行操作的延迟。

读取操作 Read operations

JetStream also has built-in mirroring or sourcing between streams: a stream can either mirror all the messages (or a subset, using subject-based filtering) from a single stream (in which case the message sequence numbers are preserved), or it can source from one or more streams (in which case the message sequence numbers are not preserved) for example to aggregate between streams. This mirroring/sourcing is done in a reliable ‘store and forward’ manner, meaning that sourcing/mirroring streams (i.e. the nodes replicating the streams) can be shut down or disconnected from the source/mirror for a period of time and will automatically catch any messages they may have missed.
JetStream 还内置了数据流之间的镜像或源数据共享功能：一个数据流可以镜像来自单个数据流的所有消息（或使用基于主题的过滤来镜像子集）（在这种情况下，消息序列号将被保留），也可以从一个或多个数据流获取消息（在这种情况下，消息序列号不会被保留），例如用于在不同数据流之间进行聚合。这种镜像/源数据共享以可靠的“存储转发”方式完成，这意味着源数据流/镜像数据流（即复制数据流的节点）可以关闭或与源数据流/镜像数据流断开连接一段时间，并且会自动捕获它们可能错过的任何消息。

Beyond controlling the placement of stream replicas within a cluster, placement tags also allow you to control the placement of stream and replicas across clusters. You can specify which cluster a stream should be located in (e.g. a stream containing PII for European users can be set to be located in a cluster in the EU), and even change the placement tags of an existing stream to move it to a different cluster, without interrupting the service.
除了控制流副本在集群内的位置之外，放置标签还允许您控制流及其副本在不同集群之间的放置。您可以指定流应位于哪个集群（例如，可以将包含欧洲用户个人身份信息 (PII) 的流设置为位于欧盟的集群中），甚至可以更改现有流的放置标签，将其移动到不同的集群，而不会中断服务。

In this mode of deployment you can have a stream located in one regional cluster and create mirrors of this stream in other regional clusters, which is the classic way to scale and provide faster read access to the client applications by having them use the mirror stream of the clusters they are connected to (this happens automatically for KV get() operations), at the expense of a certain amount of ‘incoherence’ which is unavoidable any time any kind of ‘cache’ (in this case the mirror stream) is used. This ‘eventual coherency’ is due to the fact that it takes a non-null (though typically very small, but could be longer in the case of network or hardware outages) amount of time for the mirrors to be updated with new message addition/deletion in the stream being mirrored. It is sometimes amalgamated with the term ‘eventual consistency’ but technically it is not the same thing: the ‘writes’ happen only on the (immediately consistent) origin stream, therefore they are serialized and there is only one view of the stream at any given time, and the mirrors are eventually coherent with the origin stream. This is different from an eventual consistent system where the ‘writes’ can happen at the same time in different regions and the system has to deal with the fact that there can be multiple views of the data (e.g. in a different order) at any given time.
在这种部署模式下，您可以将流放置在一个区域集群中，并在其他区域集群中创建该流的镜像。这是扩展和为客户端应用程序提供更快读取访问的经典方法，方法是让客户端应用程序使用其连接的集群的镜像流（对于 KV get() 操作，此过程会自动完成）。但代价是，当使用任何类型的“缓存”（在本例中为镜像流）时，都会不可避免地出现一定程度的“不一致性”。这种“最终一致性”源于这样一个事实：镜像需要一定时间（虽然通常很短，但在网络或硬件故障的情况下可能会更长）才能更新被镜像流中新增/删除的消息。它有时与“最终一致性”混用，但严格来说两者并不相同：写入操作仅发生在（即时一致的）源流上，因此它们是串行的，在任何给定时间都只有一个源流视图，最终镜像与源流保持一致。这与最终一致性系统不同，在最终一致性系统中，写入操作可以同时发生在不同的区域，系统必须处理在任何给定时间可能存在多个数据视图（例如，顺序不同）的情况。

写入操作 Write operations

Deploying mirrors helps scale and provides low latency for read operations, it does not however help scaling or provide high availability between the regions when it comes to ‘writes’: the origin stream is located on a cluster that is in a single region, if that region goes down entirely while other regions can still read from their mirrors of the stream, the client applications will not be able to write to the stream until the region comes back up.
部署镜像有助于扩展并降低读取操作的延迟，但对于写入操作而言，它并不能帮助扩展或提供区域间的高可用性：源流位于单个区域的集群上，如果该区域完全宕机，而其他区域仍然可以从其镜像读取流，则客户端应用程序将无法写入流，直到该区域恢复运行。

即时一致的多区域扩展集群 Immediately Consistent Multi-region Stretch Clusters

When you need immediate consistency between regions, regardless of any particular region going down, you can still do that with JetStream thanks to its implementation of RAFT which works even between regions. This is done by creating a ‘stretch’ cluster. A stretch cluster is called as such when the cluster nodes are all located in different regions and therefore the cluster is ‘stretched’ between the regions. In order to be able to create a stretch cluster you need at least 3 regions. You then add this stretch cluster to your existing super-cluster (one cluster per region) and use stream placement tags to create streams that are stored in the stretch cluster. Those stretched streams will be immediately consistent between regions, at the expense of much higher latency of synchronous operations on them. They will also be highly available as long as only one of the regions goes down, and if you stretch to 5 regions and stream replication of 5 then you can survive two regions going down. Note that much higher latency doesn’t necessarily mean much lower throughput (assuming there’s enough bandwidth), as long as your applications can leverage asynchronous publish operations.
当您需要在不同区域之间保持即时一致性时，即使某个特定区域发生故障，JetStream 也能凭借其 RAFT 实现实现这一点，RAFT 即使在区域间也能正常工作。这可以通过创建“扩展”集群来实现。当集群节点全部位于不同的区域时，集群就被称为扩展集群，因为它“跨越”了这些区域。要创建扩展集群，您至少需要 3 个区域。然后，您可以将此扩展集群添加到现有的超级集群（每个区域一个集群），并使用流放置标签创建存储在扩展集群中的流。这些扩展流将在不同区域之间保持即时一致性，但代价是同步操作的延迟会显著增加。只要只有一个区域发生故障，它们就能保持高可用性；如果您扩展到 5 个区域并进行 5 倍流复制，则可以承受两个区域的故障。请注意，更高的延迟并不一定意味着更低的吞吐量（假设带宽足够），只要您的应用程序能够利用异步发布操作即可。

You can combine this with the mirroring/sourcing feature to create mirrors of the streams on the stretch cluster into the regional clusters in order to have low read latency, but the latency on write operations will always be proportional to the latency between the regions.
您可以结合镜像/源功能，将扩展集群上的流镜像到区域集群，从而降低读取延迟。但写入操作的延迟始终与区域间的延迟成正比。

In practice, network conditions such as latency, packet loss, and bandwidth will dictate the limits of the applicability of a stretch cluster, if the RTT latencies between regions are high then operations will take much longer to complete and some client applications may start timing out waiting on synchronous calls. If the connectivity between the regions continuously changes, it could temporarily affect the stream’s availability as well as at least 2 out of the 3 nodes must be up and reachable for RAFT votes to succeed.
实际上，延迟、丢包率和带宽等网络状况将决定扩展集群的适用范围。如果区域间的往返时间 (RTT) 延迟较高，则操作完成时间会大大延长，某些客户端应用程序可能会因等待同步调用而超时。如果区域间的连接持续变化，可能会暂时影响流的可用性。此外，RAFT 投票必须至少有 2 个节点处于运行状态且可达，才能成功。

实际案例 Real-world example

With the proper choice of your ‘regions’ and provisioning of the network connection between them, you can still get pretty good latency of write and read operations on those stretched streams. If you are interested in the details of an actual production implementation of a stretch cluster spanning multiple cloud providers where the P99 write latency under load is < 20ms, you can view Derek Collison’s in-depth talk at the P99 conference.
通过合理选择“区域”并配置区域间的网络连接，您仍然可以获得非常低的跨域数据流读写延迟。如果您想了解跨多个云提供商的跨域集群的实际生产部署细节（其中 P99 写入延迟在负载下小于 20 毫秒），可以观看 Derek Collison 在 P99 大会上的深入演讲。

super-stretch-cluster
包含跨越三个区域的扩展集群的超级集群示意图

最终一致的多区域“虚拟流” Eventually Consistent Multi-region ‘virtual stream’

While immediate consistency is the highest quality of service, there are many use cases where you know from the business or application logic that it is not needed. Basically, if you know it is not possible for the same ‘key’ to be modified from two different places (regions) at the same time (while there is an outage or network split) then you do not need immediate consistency.
虽然即时一致性是最高服务质量，但在许多用例中，根据业务或应用程序逻辑，您知道并不需要即时一致性。基本上，如果您知道在发生中断或网络分裂的情况下，同一个“键”不可能同时在两个不同的位置（区域）被修改，那么您就不需要即时一致性。

Thanks to new JetStream features introduced in version 2.10 you can now create a ‘virtual stream’ that is globally distributed (i.e. to all regions) meaning that client applications to transparently publish to and read from with the low latency of interacting with a local (to the region) stream regardless of the region they are connected to, while still retaining eventual consistency between the regions with the single caveat that global ordering of the messages on that virtual stream is not guaranteed.
得益于 JetStream 2.10 版本中引入的新功能，您现在可以创建一个全局分布（即覆盖所有区域）的“虚拟流”。这意味着客户端应用程序可以透明地发布和读取消息，其延迟与本地（区域）流的交互一样低，而无需考虑它们连接到哪个区域。同时，仍然保持区域间的最终一致性，但需要注意的是，无法保证该虚拟流上消息的全局顺序。

I say the stream is ‘virtual’ because unbeknownst to the client applications they are interacting with a number of streams (two per region) that source from each other.
我说这个流是“虚拟的”，是因为客户端应用程序并不知道它们正在与多个流（每个区域两个）进行交互，而这些流彼此之间是相互的。

工作原理 How it works

At a very high level, in each region there is a ‘write stream’ and a ‘read stream’, and the read streams source from the write streams.
从宏观层面来看，每个区域都有一个“写入流”和一个“读取流”，读取流源自写入流。

The client applications publish to the write stream and read from the read stream for the region they are connected to, this happens transparently for the application using the Core NATS subject mapping and transformation feature, which (as of 2.10) can also be cluster-scoped.
客户端应用程序向其所连接的区域的写入流发布消息，并从读取流读取消息。对于应用程序而言，这一切都是透明的，这得益于 Core NATS 的主题映射和转换功能（自 2.10 版本起，该功能还可以应用于集群范围）。

While this is conceptually very simple, the actual implementation is a little bit complicated by the fact that Core NATS messages flow freely between the clusters in a Super-Cluster (and between Leaf Nodes unless some kind of filtering is applied at the authorization level), combined with the fact that you can not have more than one stream listening on the same subject. Also, you can not create ‘loops’ in the sourcing between streams (i.e. stream A sources from stream B and stream B sources from stream A).
虽然概念上非常简单，但实际实现却略显复杂。这是因为 Core NATS 消息在超级集群中的各个集群之间（以及叶节点之间，除非在授权层应用某种过滤）可以自由流动，而且不能有多个流监听同一个主题。此外，流之间的源信息也不能形成“循环”（例如，流 A 源自流 B，而流 B 又源自流 A）。

So how is this possible? By using a number of the new features introduced in NATS version 2.10:
那么，这一切是如何实现的呢？通过使用 NATS 2.10 版本中引入的多项新功能：

The introduction of subject mapping and transformation features at the stream level (i.e. as part of the stream definition level as opposed to the Core NATS account level).
在流级别（即流定义级别，而非 NATS 核心账户级别）引入了主题映射和转换功能。
The existing Core NATS subject mapping and transformation has been extended with the ability to define ‘cluster-scoped’ mappings and transformations.
现有的 NATS 核心主题映射和转换功能已得到扩展，能够定义“集群范围”的映射和转换。
The relaxation of some of the stream sourcing and subject mapping and transformation rules including allowing the dropping of a wildcard subject token as part of the transformation (unless the mapping is part of a cross-account import/export).
放宽了部分流源和主题映射及转换规则，包括允许在转换过程中丢弃通配符主题标记（除非映射是跨账户导入/导出的一部分）。

写入虚拟流 Writes to the virtual stream

For each region, there is a ‘write’ stream located in that regional cluster that captures the messages published on subjects prepended with a subject token designating the region this stream is servicing. The stream listens to subjects that contain a token identifying the region.
每个区域都有一个位于该区域集群中的“写入”流，用于捕获发布在主题上的消息，这些主题以指定该流所服务的区域的主题标记开头。该流监听包含标识该区域的标记的主题。

For a simple example for a virtual stream foo capturing messages published on subjects matching foo.> (i.e. any subject starting with the token foo), in the region west, the write stream could be called foo-write-west and listen on foo.west.> (you can change the order of the subject tokens and use wildcards to suit your needs).
例如，假设有一个名为 foo 的虚拟流，用于捕获发布在匹配 foo.> 的主题上的消息（即任何以标记 foo 开头的主题），在 west 区域中，写入流可以命名为 foo-write-west，并监听 foo.west.>（您可以根据需要更改主题标记的顺序并使用通配符）。

Once you have done that in all your regions you can JS publish (from anywhere) a message on foo.west.> and it will be persisted in the write stream in region west. But that means the client application has to know which region it is connected to in order to know which subject name to publish to. This can be remediated by setting up some Core NATS subject mappings (which are defined at the account level) and defining a cluster-scoped subject mapping per region such that in our example there is a subject mapping from foo.> to foo.west.> that applies only for cluster west, which means that any application connected to the west cluster that publishes a message on subject starting with foo will transparently be the same as if they had published it starting with foo.west.
在所有区域中完成此操作后，您可以从任何位置通过 JavaScript 发布消息到 foo.west.>，该消息将被持久化到 west 区域的写入流中。但这意味着客户端应用程序必须知道它连接到哪个区域，才能知道要发布到哪个主题名称。这可以通过设置一些核心 NATS 主题映射（在帐户级别定义）并为每个区域定义集群范围的主题映射来解决，这样在我们的示例中，就存在一个从 foo.> 到 foo.west.> 的主题映射，该映射仅适用于 west 集群，这意味着连接到 west 集群的任何应用程序，如果发布以 foo 开头的主题消息，其效果将与以 foo.west 开头的主题消息的效果完全相同。

最终将写入操作复制到虚拟流中。 Eventually replicate the writes to the virtual stream

The second set of streams underlying the virtual stream are the ‘read’ streams, which source the ‘write’ streams, and strip the token indicating the region of origin from the subject.
虚拟流底层的第二组流是“读取”流，它们作为“写入”流的源，并从主题中移除指示来源区域的标记。

So using the same simple example on region ‘west’ there would be a stream foo-read-west that doesn’t listen to any subjects and sources from the stream foo-write-east, foo-write-central and foo-write-west and then strips the region name token by applying a subject transform from foo.*.> to foo.> (i.e. dropping the second token of the subject name). This means that the messages in the ‘read’ streams are under subjects starting with foo, the same subject the publishing application used (you can still tell which region the message was published in from a message header).
因此，以“west”区域为例，会存在一个名为 foo-read-west 的流，它不监听任何主题，并从 foo-write-east、foo-write-central 和 foo-write-west 流中获取消息，然后通过应用主题转换（从 foo.*.> 到 foo.>，即删除主题名称的第二个标记）来移除区域名称标记。这意味着“读取”流中的消息位于以 foo 开头的主题下，与发布应用程序使用的主题相同（您仍然可以从消息头中判断消息发布到哪个区域）。

Because of the reliable store-and-forward stream sourcing mechanism, you are ensured that all the ‘read’ streams will eventually contain all of the messages published on all of the ‘write’ streams, although not necessarily in the same order.
由于采用了可靠的存储转发流源机制，可以确保所有“读取”流最终都会包含所有“写入”流上发布的所有消息，尽管顺序不一定相同。

从虚拟流读取数据 Reading from the virtual stream

Except for streams where the ‘direct get’ option is enabled (e.g. KV buckets) where direct get operations are automatically directed to any of the nodes within the local cluster replicating a mirror of the stream, if a client application wants to interact with a locally mirrored or sourced stream it needs to know the name of local stream, which means that it needs to know which region it is connected to. Avoiding this constraint is just like transparently dealing with publications to the virtual stream and can be done by setting a few (cluster-scoped) subject mapping transformations for the account at the Core NATS level.
除了启用了“直接获取”选项的流（例如 KV 存储桶）之外，如果客户端应用程序想要与本地镜像或源流交互，则需要知道本地流的名称，这意味着它需要知道自己连接到哪个区域。避免此限制与透明地处理发布到虚拟流的操作类似，可以通过在 Core NATS 级别为帐户设置一些（集群范围的）主题映射转换来实现。

Besides the aforementioned direct get requests the way client applications ‘read’ (or consume) messages from a stream is through creating JetStream consumers (shared or not) and that is implemented over a number of JetStream API subjects which (unless JS domains are used) start with $JS.API, and also contain either a stream name or a consumer name as a token of that subject. Such that requests to create consumers on a stream foo are transparently transformed into requests to create consumers on the local <region>-read-foo stream instead.

除了上述直接获取请求之外，客户端应用程序从流中“读取”（或消费）消息的方式是通过创建 JetStream 消费者（共享或非共享）来实现的，这是通过多个 JetStream API 主题实现的，这些主题（除非使用 JS 域）以 $JS.API 开头，并且包含流名称或消费者名称作为该主题的令牌。这样，对流 foo 上创建消费者的请求将被透明地转换为对本地 <region>-read-foo 流上创建消费者的请求。

So for example: define a cluster-scoped subject mapping from "$JS.API.CONSUMER.CREATE.foo.*" to "$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}" on cluster west such that any application connected to that cluster and creating a consumer on stream foo will create a consumer on stream foo-read-west.
例如：在集群 west 上定义一个集群范围的主题映射，将 "$JS.API.CONSUMER.CREATE.foo.*" 映射到 "$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}"，这样，任何连接到该集群并在流 foo 上创建消费者的应用程序，都会在流 foo-read-west 上创建一个消费者。

虚拟流的局限性 What you can NOT do with a virtual stream

虚拟流的保留策略不能是“工作队列”或“兴趣”（即只能是“限制”）。
The retention policy of the virtual stream can not be a ‘working queue’ or ‘interest’ (i.e. only ‘limits’).
除非您确定不会同时（或在脑裂期间）从两个不同的区域修改同一个键，否则它不适用于键值存储桶。
It does not work for KV buckets unless you know that you are not modifying the same key at the same time (or during a split brain) from two different regions.
流消费者是“按区域”分配的，这意味着虚拟流上没有全局命名的持久消费者，而是有多个区域消费者。
Stream consumers are ‘per region’, meaning you do not have a global named durable consumer on the virtual stream, but multiple regional ones.
无法从虚拟流中删除单个消息，删除操作只会应用于本地“读取”流，不会传播（也不应该传播，因为消息序列号在不同区域之间不一致）。
Deleting individual messages from a virtual stream is not possible, the delete operation will only apply to the local ‘read’ stream and are not propagated (and neither should they be, as the message sequence numbers are not homogeneous between regions).
无法在虚拟流上执行比较和设置操作，因为您会从一个流读取数据并写入另一个流，而消息序列号在它们之间不会保留。
Compare-and-set operations are not possible on the virtual stream as you would be reading from one stream and writing to another and the message sequence numbers are not preserved between them.

实际演练 Walkthrough

In this example, we’re going to walk through setting up a local Super-Cluster and creating a virtual stream ‘foo’ Make sure to install (or upgrade to) the latest version of the NATS server and of the nats CLI tool on your local machine.
在这个示例中，我们将逐步介绍如何设置本地超级集群并创建虚拟流“foo”。请确保在本地计算机上安装（或升级到）最新版本的 NATS 服务器和 nats CLI 工具。

git clone https://github.com/synadia-labs/eventually-consistent-virtual-global-stream.git

访问: 这个 GitHub 仓库

设置 The setup

This walkthrough will create and start locally a total of 9 nats-servers organized in 3 clusters east, central and west of 3 nodes each interconnected as a Super-Cluster. Once those servers are started it will create all of the ‘read’ and ‘write’ streams for all 3 regions.
本教程将在本地创建并启动总共 9 台 NATS 服务器，这些服务器分为 3 个集群，分别位于东部、中部和西部，每个集群由 3 个节点组成，并互连成一个超级集群。服务器启动后，将为所有 3 个区域创建所有“读取”和“写入”流。

You will then be able to play with the virtual stream foo using nats by connecting to different local clusters and using and publishing or reading with the (virtual) stream foo as if it were a single globally replicated stream.
之后，您可以使用 NATS 连接到不同的本地集群，并使用（虚拟）流 foo 进行操作，例如发布或读取数据，就像使用单个全局复制流一样。

服务器配置 Server configurations

The individual server configuration files are straightforward. Each server establishes route connections to its 2 other peers in the cluster, and the clusters are connected via gateway connections. In this example, all of the individual server’s configuration files import a single mappings.cfg file containing all of the Core NATS account level subject mapping transforms, which in this case are all cluster-scoped. If you were running your servers in the ‘operator’ security mode, those mappings would be stored (in the account resolver) as part of the account(s) JWT(s) instead.
各个服务器的配置文件非常简单。每台服务器都与其集群中的其他 2 个对等节点建立路由连接，集群之间通过网关连接。在本例中，所有服务器的配置文件都导入一个包含所有核心 NATS 帐户级别主题映射转换的 mappings.cfg 文件，这些转换在本例中均为集群范围。如果您的服务器运行在“操作员”安全模式下，则这些映射将作为帐户 JWT 的一部分存储在帐户解析器中。

mappings = {
    "foo.>":[
       {destination:"foo.west.>", weight: 100%, cluster: "west"},
       {destination:"foo.central.>", weight: 100%, cluster: "central"},
       {destination:"foo.east.>", weight: 100%, cluster: "east"}
    ],
    "$JS.API.STREAM.INFO.foo":[
       {destination:"$JS.API.STREAM.INFO.foo-read-west", weight: 100%, cluster: "west"},
       {destination:"$JS.API.STREAM.INFO.foo-read-central", weight: 100%, cluster: "central"},
       {destination:"$JS.API.STREAM.INFO.foo-read-east", weight: 100%, cluster: "east"}
    ],
    "$JS.API.CONSUMER.DURABLE.CREATE.foo.*":[
       {destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
       {destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
       {destination:"$JS.API.CONSUMER.DURABLE.CREATE.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
    ],
    "$JS.API.CONSUMER.CREATE.foo.*":[
       {destination:"$JS.API.CONSUMER.CREATE.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
       {destination:"$JS.API.CONSUMER.CREATE.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
       {destination:"$JS.API.CONSUMER.CREATE.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
    ],
    "$JS.API.STREAM.MSG.GET.foo":[
       {destination:"$JS.API.STREAM.MSG.GET.foo-read-west", weight: 100%, cluster: "west"},
       {destination:"$JS.API.STREAM.MSG.GET.foo-read-central", weight: 100%, cluster: "central"},
       {destination:"$JS.API.STREAM.MSG.GET.foo-read-east", weight: 100%, cluster: "east"}
    ],
    "$JS.API.STREAM.MSG.DIRECT.foo":[
       {destination:"$JS.API.STREAM.DIRECT.GET.foo-read-west", weight: 100%, cluster: "west"},
       {destination:"$JS.API.STREAM.DIRECT.GET.foo-read-central", weight: 100%, cluster: "central"},
       {destination:"$JS.API.STREAM.DIRECT.GET.foo-read-east", weight: 100%, cluster: "east"}
    ],
    "$JS.API.STREAM.MSG.DELETE.foo":[
       {destination:"$JS.API.STREAM.MSG.DELETE.foo-read-west", weight: 100%, cluster: "west"},
       {destination:"$JS.API.STREAM.MSG.DELETE.foo-read-central", weight: 100%, cluster: "central"},
       {destination:"$JS.API.STREAM.MSG.DELETE.foo-read-east", weight: 100%, cluster: "east"}
    ],
    "$JS.API.CONSUMER.MSG.NEXT.foo.*":[
       {destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-west.{{wildcard(1)}}", weight: 100%, cluster: "west"},
       {destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-central.{{wildcard(1)}}", weight: 100%, cluster: "central"},
       {destination:"$JS.API.CONSUMER.MSG.NEXT.foo-read-east.{{wildcard(1)}}", weight: 100%, cluster: "east"}
    ],
    "$JS.ACK.foo.>":[
       {destination:"$JS.ACK.foo-read-west.>", weight: 100%, cluster: "west"},
       {destination:"$JS.ACK.foo-read-central.>", weight: 100%, cluster: "central"},
       {destination:"$JS.ACK.foo-read-east.>", weight: 100%, cluster: "east"}
    ]
}

启动服务器

您可以使用提供的简单脚本启动整个超级集群。

source startservers

This script also defines 3 nats contexts to allow you to easily select which cluster you want to connect to: sc-west, sc-central and sc-east.
该脚本还定义了 3 个 NAT 上下文，以便您可以轻松选择要连接的集群：sc-west、sc-central 和 sc-east。

定义本地流 Defining the local streams

After a few seconds the Super-Cluster should be up and running, and then define for the first time all of the required local streams that are configured using JSON files and there is a simple convenience script to define them all.
几秒钟后，超级集群应该启动并运行，然后首次定义所有必需的本地流，这些流使用 JSON 文件进行配置，并且有一个简单的便捷脚本可以定义所有这些流。

source definestreams

Taking the west cluster as an example below are the JSON stream definitions for both streams.
以西部集群为例，以下是两个流的 JSON 流定义。

The local ‘write’ stream is quite straightforward: it is named "foo-write-west" and all it needs to do is listen on the subjects "foo.west.>":
本地 write 写入 流非常简单：它被命名为 "foo-write-west"，它只需要监听主题 "foo.west.>" 即可。

{
  "name": "foo-write-west",
  "subjects": [
    "foo.west.>"
  ],
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs_per_subject": -1,
  "max_msgs": -1,
  "max_bytes": -1,
  "max_age": 3600000000000,
  "max_msg_size": -1,
  "storage": "file",
  "discard": "old",
  "num_replicas": 3,
  "duplicate_window": 120000000000,
  "placement": {
    "cluster": "west"
  },
  "sealed": false,
  "deny_delete": false,
  "deny_purge": false,
  "allow_rollup_hdrs": false,
  "allow_direct": false,
  "mirror_direct": false
}

Note that in this example a max-age limit of 3600000000000 (1 hour) set on the ‘write’ streams, meaning that the maximum length of a regional outage or split-brain that can be recovered from without any message write loss is 1 hour. You need a limit to ensure that the ‘write’ streams don’t just grow forever as they only need to hold data for as long as the outage lasts, adjust this limit to fit your specific requirements.
请注意，本例中 write 流的 max-age 限制为 3600000000000（1 小时），这意味着在不丢失任何消息写入的情况下，区域性中断或脑裂的最大恢复时间为 1 小时。您需要设置此限制以确保 wirte 流不会无限增长，因为它们只需在中断期间保存数据即可。您可以根据具体需求调整此限制。

The local ‘read’ stream doesn’t listen to any subjects and sources all the ‘write’ streams (see the sources array) and performs a simple subject transformation to drop the token in the subject name that contains the name of the region of origin (see the subject_transform stanza).
本地 read 流不监听任何主题，并且从所有 read 流获取数据（参见 sources 数组），并执行简单的主题转换，以删除主题名称中包含来源区域名称的标记（参见 subject_transform 部分）。

{
  "name": "foo-read-west",
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs_per_subject": -1,
  "max_msgs": -1,
  "max_bytes": -1,
  "max_age": 0,
  "max_msg_size": -1,
  "storage": "file",
  "discard": "old",
  "num_replicas": 3,
  "duplicate_window": 120000000000,
  "placement": {
    "cluster": "west"
  },
  "subject_transform": {
    "src":"foo.*.>",
    "dest":"foo.>"
  },
  "sources": [
    {
      "name": "foo-write-west",
      "filter_subject": "foo.west.>"
    },
    {
      "name": "foo-write-east",
      "filter_subject": "foo.east.>"
    },
    {
      "name": "foo-write-central",
      "filter_subject": "foo.central.>"
    }
  ],
  "sealed": false,
  "deny_delete": false,
  "deny_purge": false,
  "allow_rollup_hdrs": false,
  "allow_direct": false,
  "mirror_direct": false
}

So using the region ‘west’ as an example a message published on foo.test by an application connected to the ‘west’ cluster will be first stored with the subject foo.west.test in the foo-write-west stream and the stream foo-read-west sources from foo-write-west and strips the second token of the subject such as the message ends up being stored in that stream with the subject foo.test.
因此，以 west区域为例，连接到 west 集群的应用程序在 foo.test 上发布的消息将首先以 foo.west.test 为主题存储在 foo-write-west 流中，而流 foo-read-west 从 foo-write-west 获取消息，并去除主题的第二个标记，因此该消息最终以 foo.test 为主题存储在该流中。

global-virtual-region

Drawing of the transformation of the subject of a message published on foo.test in region west as it makes its way from a publishing to a consuming client application.
绘制在 foo.test 上发布的 west 区域的消息主题从发布客户端应用程序到消费客户端应用程序的转变过程。

与全局虚拟流交互 Interacting with the global virtual stream

You can use nats --context to interact with the stream as would a client connecting to the different clusters.
您可以使用 nats --context 与流进行交互，就像客户端连接到不同的集群一样。

For example let’s connect to the ‘west’ cluster and publish a message on the subject foo.test:
例如，让我们连接到“west”集群，并发布一条主题为 foo.test 的消息：

nats --context sc-west req foo.test 'Hello world from the west region'

Using nats req rather than nats pub here in order to see the JetStream publish acknowledgement just like a client application would when using the JetStream publish() call and checking that the PubAck does not contain an error.
这里使用 nats req 而不是 nats pub，是为了像客户端应用程序使用 JetStream publish() 调用时那样，查看 JetStream 发布确认信息，并检查 PubAck 是否包含错误。

We can then check that the message has indeed propagated to all the regions, in this example using the nats stream view command (that creates an ephemeral consumer on the stream and then iterate over it to get and display the messages).
然后，我们可以检查消息是否确实已传播到所有区域。在本例中，我们使用 nats stream view 命令（该命令会在流上创建一个临时消费者，然后遍历该消费者以获取并显示消息）。

nats --context sc-west stream view foo

You can see that the message stored in the global virtual ‘foo’ stream is indeed there with the subject foo.test which we used earlier to publish the message. Let’s check that the message has also made it to the other clusters:
您可以看到，存储在全局虚拟流“foo”中的消息确实存在，其主题为foo.test，我们之前就是用这个主题发布消息的。让我们检查一下该消息是否也已发送到其他集群：

nats --context sc-central stream view foo

和

nats --context sc-east stream view foo

You can also even do a nats stream info on the virtual stream (this will show you the info about your local ‘read’ stream), but note how nats stream ls doesn’t show the global virtual stream, but rather all of its (non-virtual) underlying local streams.
你甚至可以对虚拟流执行 nats stream info 命令（这将显示本地“读取”流的信息），但请注意，nats stream ls 命令显示的不是全局虚拟流，而是其所有（非虚拟）底层本地流。

模拟灾难 Simulating disasters

You can simulate whole regions going down by killing all of the nats-server processes for a region, there are some simple convenience scripts in the repository to kill or restart regions easily.
你可以通过终止某个区域的所有 nats-server 进程来模拟整个区域的宕机。仓库中提供了一些简单的便捷脚本，可以轻松地终止或重启区域。

终止单个区域 Killing one region

For example: let’s first kill the central region cluster
例如：我们先终止中央区域的集群。

source killcentral

Then publish message from or ‘east’
然后发布来自/或 east 的消息

nats --context sc-east req foo.test 'Hello world from the east region'

Check that the message made it to ‘west’
检查一下信息是否已送达 west。

nats --context sc-west stream view foo

Then restart ‘central’
然后重新启动 central

source startcentral

It may take up to a couple of seconds for the recovery to complete then check that the message is now there in ‘central’
恢复过程可能需要几秒钟才能完成，然后检查消息是否已出现在 central 位置。

nats --context sc-central stream view foo

同时杀死两个区域使其瘫痪，模拟裂脑。 Killing two regions to go down at once and simulating a split brain

The two failure scenarios are similar and related: a split brain from the point of view of the region getting isolated is no different from both of the other two regions going down at the same time.
这两种故障场景相似且相关：从区域隔离的角度来看，脑裂与另外两个区域同时宕机并无本质区别。

The difference being that in the case of split brain, the two other regions that can still see each other continue to operate normally (including processing new ‘writes’) and the isolated regions ends up in the same ‘limited’ mode of operation as in the case when two regions do down at the same time.
不同之处在于，在脑裂的情况下，仍然可以相互通信的两个区域会继续正常运行（包括处理新的写入操作），而隔离的区域最终会进入与两个区域同时宕机时相同的“受限”运行模式。

As soon as the network partition gets resolved or as the missing regions come back up the two parts of the brain will replicate missed messages between themselves and eventually become consistent again (though not necessarily in the same order).
一旦网络分区得到解决或缺失的区域恢复运行，大脑的两个部分就会相互复制丢失的信息，最终再次保持一致（尽管顺序可能有所不同）。

In the case of two regions going down at the same time or of being the smaller part of the split brain the remaining region can still operate but in a ‘limited’ fashion, as not all functionality will be available since there will be an inability for the remaining nodes to elect a JetStream ‘meta leader’.
如果两个区域同时宕机，或者某个区域是脑裂中较小的部分，剩余区域仍能运行，但功能会受到限制，因为剩余节点无法选举 JetStream 的“元领导者”，导致部分功能不可用。

Publications to the stream will still work, the only way publications to stream in a regions would stop working is if 2 of the 3 servers in the region (or 3 out of 5) go down at the same time.
向流发布消息仍然有效，只有当区域中的 3 个服务器中的 2 个（或 5 个服务器中的 3 个）同时宕机时，向流发布消息才会停止。
Get operations (e.g. what the KV ‘get’ operation uses) will still work.
获取操作（例如，KV 的“get”操作）仍然有效。
Getting messages from already existing consumers (at the time the second regions goes down) on the stream will still work, and locally published messages will be seen in the ‘read’ stream right away.
从流上已有的消费者（在第二个区域宕机时）获取消息仍然有效，本地发布的消息会立即出现在“读取”流中。
However, creating new consumers (or new Streams) will not work.
但是，创建新的消费者（或新的流）将无法正常工作。

First kill both ‘west’ and ‘east’
首先关掉 west 和 east

source killwest; source killeast

Publish a new message on ‘central’ (as if it was isolated region)
在 central （如同一个孤立区域）发布一条新消息

nats --context sc-central req foo.test 'Hello world from the central region'

Then bring down the ‘central’ region and ‘east and ‘west’ back up
然后把 central 区域、east 和 west 区域都拉下来。

source killcentral; source startwest; source starteast

Wait up to a couple seconds and publish another message from one of those two regions
稍等几秒钟，然后从这两个区域之一发布另一条消息。

nats --context sc-east req foo.test 'Hello again from the east region'

Check you can create a new consumer and see that message from the other region
请检查您是否可以创建一个新的消费者，并查看来自其他区域的消息。

nats --context sc-west stream view foo

And finally resolve the split brain by restarting ‘central’
最后通过重启 central 来解决裂脑问题。

source startcentral

After a few seconds you can see that all the messages where are now present in all the ‘read’ streams, though not necessarily in the same order by comparing the output of
几秒钟后，你可以看到所有消息现在都出现在所有 read 流中，尽管顺序不一定相同，这可以通过比较输出结果来判断。

nats --context sc-west stream view foo

与

 nats --context sc-central stream view foo

如果您仍然需要全局顺序怎么办？ What if you still want global ordering?

If you want to retain the ability to handle client writes locally and yet still want global ordering of the messages, you can use a stretch cluster to home an ‘ordering’ stream and have the local read streams mirror that stream. The write streams remain the same and that ‘ordering’ stream sources from them. Compared to simply having the stream located in the stretch cluster and the read streams mirroring it, and having the client applications just publish directly to the stretched stream, this provides lower ‘write’ latency and higher availability but does take away the ‘compare-and-set’ functionality (e.g. the KV Update operation) that you still retain when writing directly to the stretched stream.
如果您希望保留在本地处理客户端写入操作的能力，同时又希望消息保持全局排序，可以使用扩展集群来托管一个 ordering 排序 流，并让本地读取流镜像该流。写入流保持不变，而 ordering 排序 流则从这些写入流中获取数据。与简单地将流放置在扩展集群中，读取流镜像该流，并让客户端应用程序直接发布到扩展流相比，这种方法可以降低 write 写入 延迟并提高可用性，但会失去直接写入扩展流时仍然保留的 compare-and-set 比较并设置 功能（例如键值更新操作）。

结论 Conclusion

When it comes to multi-region/cloud/site/etc… active-active consistent ‘global’ deployments, NATS JetStream not only has all of the needed functionality built-in but also has extensive flexibility when it comes to replication, mirroring, sourcing, and generally creating local consistent copies of the data including the ability to have (on a per-stream basis) the choice between immediate or eventual global consistency. And leveraging some of the new features of 2.10, you can make eventually consistent globally distributed streams in a manner that is completely transparent to the client applications, such that the client application doesn’t even need to know (e.g. need to be configured with a region name) which region it is deployed in, and yes still have both reading from, and writing to, the stream handled by the local regional NATS servers (thereby with low latency).
对于多区域/云/站点等多活动的 global 主动一致性部署，NATS JetStream 不仅内置了所有必要的功能，而且在数据复制、镜像、数据源以及创建本地一致性副本方面也具有极大的灵活性，包括能够（基于每个数据流）选择立即或最终的全局一致性。利用 2.10 版本的一些新特性，您可以创建最终一致的全局分布式数据流，而客户端应用程序对此完全透明，甚至无需知道（例如，无需配置区域名称）其部署在哪个区域，并且仍然可以从本地 NATS 服务器读取和写入数据流（从而实现低延迟）。

When it comes to global distributed immediate or eventual data consistency with JetStream you can indeed have your cake and eat it too!
使用 JetStream 实现全局分布式立即或最终数据一致性，您确实可以两全其美！

posted on 2025-11-24 17:06 冠军阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

冠军

导航

公告