(翻译 https://gafferongames.com/) Deterministic Lockstep 确定性的Lockstep

https://gafferongames.com/post/deterministic_lockstep/

本文主要讲了3方面

1.使用UDP而不是TCP去实现 input sync

2.接收方实现Delay Buffer去避免接受不及时导致的抖动问题

3.冗余发送去解决 input sync丢包问题

Deterministic lockstep is a method of networking a system from one computer to another by sending only the inputs that control that system, rather than the state of that system. In the context of networking a physics simulation, this means we send across a small amount of input, while avoiding sending state like position, orientation, linear velocity and angular velocity per-object.

The benefit is that bandwidth is proportional to the size of the input, not the number of objects in the simulation. Yes, with deterministic lockstep you can network a physics simulation of one million objects with the same bandwidth as just one.

While this sounds great in theory, in practice it’s difficult to implement deterministic lockstep because most physics simulations are not deterministic. Differences in floating point behavior between compilers, OS’s and even instruction sets make it almost impossible to guarantee determinism for floating point calculations.

确定性锁步(Deterministic Lockstep) 是一种网络同步方法,它通过仅发送控制系统的输入,而不是整个系统的状态,实现从一台计算机到另一台计算机的同步。

物理模拟的网络同步中,这意味着我们只需要传输少量的输入数据,而无需发送每个物体的状态数据(例如位置、方向、线速度和角速度)。

优点:

这样做的好处是所需带宽与输入数据的大小成正比,而不是与模拟中的物体数量成正比
换句话说,使用确定性锁步,即使是100 万个物体的物理模拟,其网络带宽消耗也可以和单个物体相同!

挑战:

虽然理论上听起来很棒,但实际实现确定性锁步却非常困难,因为大多数物理模拟并不是完全确定性的
由于不同编译器、操作系统甚至CPU 指令集在浮点运算上的行为存在差异,几乎不可能确保浮点计算的完全确定性

 

Determinism

Determinism means that given the same initial condition and the same set of inputs your simulation gives exactly the same result. And I do mean exactly the same result.

Not close. Not near enough. Exactly the same. Exact down to the bit-level. So exact, you could take a checksum of your entire physics state at the end of each frame and it would be identical.

确定性(Determinism) 指的是,在相同的初始条件相同的输入下,模拟的结果必须完全一致

我指的是真正的完全一致

不是接近,不是“差不多”,而是 精确到每一位(bit-level) 的相同。

精确到什么程度?
你可以在每一帧结束时,对整个物理状态进行校验和(checksum),结果都必须完全一致

https://gafferongames.com/videos/deterministic_lockstep_desync.mp4

Above you can see a simulation that is almost deterministic. The simulation on the left is controlled by the player. The simulation on the right has exactly the same inputs applied with a two second delay starting from the same initial condition. Both simulations step forward with the same delta time (a necessary precondition to ensure exactly the same result) and both simulations apply the same inputs. Notice how after the smallest divergence the simulation gets further and further out of sync. This simulation is non-deterministic.

上方的示例展示了一个几乎确定性的模拟。

  • 左侧的模拟由玩家控制。

  • 右侧的模拟使用完全相同的输入,但延迟 2 秒,并从相同的初始状态开始。

两次模拟都以相同的时间步长(delta time) 前进(这是确保结果完全一致的必要前提),并且两者都应用了相同的输入

然而,请注意:即使是最微小的偏差,也会导致模拟逐渐失去同步

这说明,这个模拟并不是确定性的(non-deterministic)

What’s going on is that the physics engine I’m using (Open Dynamics Engine) uses a random number generator inside its solver to randomize the order of constraint processing to improve stability. It’s open source. Take a look and see! Unfortunately this breaks determinism because the simulation on the left processes constraints in a different order to the simulation on the right, leading to slightly different results. Luckily all that is required to make ODE deterministic on the same machine, with the same complied binary and on the same OS (is that enough qualifications?) is to set its internal random seed to the current frame number before running the simulation via dSetRandomSeed. Once this is done ODE gives exactly the same result and the left and right simulations stay in sync.

问题出在物理引擎(Open Dynamics Engine,简称 ODE)。

ODE 的求解器在处理约束时,会使用随机数生成器随机化约束处理的顺序,以提高稳定性。
(它是开源的,你可以自行查看!)

然而,这破坏了确定性,因为左侧和右侧的模拟以不同的顺序处理约束,最终导致结果出现细微差异

解决方案:
幸运的是,要让 ODE 在同一台机器相同的编译二进制文件相同的操作系统(这够多限定条件了吧?😆)下保持确定性,我们只需要在运行模拟前,使用 dSetRandomSeed 将 ODE 的内部随机种子设置为当前帧编号

一旦完成这个步骤,ODE 就会给出完全相同的结果,使得左侧和右侧的模拟保持同步

And now a word of warning. Even though the simulation above is deterministic on the same machine, that does not necessarily mean it would also be deterministic across different compilers, a different OS or different machine architectures (eg. PowerPC vs. Intel). In fact, it’s probably not even deterministic between debug and release builds due to floating point optimizations.

Floating point determinism is a complicated subject and there’s no silver bullet.

For more information please refer to this article.

现在要提醒你一点注意事项。

即使上面的模拟在同一台机器上是确定性的,但这并不意味着它在不同的编译器不同的操作系统,甚至不同的 CPU 架构(例如 PowerPC vs. Intel)上也能保持确定性。

事实上,由于浮点数优化的原因,甚至在Debug 版和 Release 版之间,模拟结果很可能也不是确定性的。

浮点数的确定性是一个复杂的问题,并不存在万能的解决方案

想了解更多信息,请参考这篇文章

 

Networking Inputs

Now let’s get down to implementation.

Our example physics simulation is driven by keyboard input: arrow keys apply forces to make the player cube move, holding space lifts the cube up and blows other cubes around, and holding ‘z’ enables katamari mode.

How can we network these inputs? Must we send the entire state of the keyboard? No. It’s not necessary to send the entire keyboard state, only the state of the keys that affect the simulation. What about key press and release events then? No. This is also not a good strategy. We need to ensure that exactly the same input is applied on the right side, at exactly the same time, so we can’t just send ‘key pressed’, and ‘key released’ events over TCP.

What we do instead is represent the input with a struct and at the beginning of each simulation frame on the left side, sample this struct from the keyboard:

网络同步输入(Networking Inputs)

现在,让我们进入具体实现

我们的示例物理模拟是由键盘输入驱动的:

  • 方向键 施加力,使玩家方块移动。

  • 按住空格键 提升方块并吹动其他方块。

  • 按住 'Z' 启用“Katamari 模式”。

如何网络同步这些输入?

我们是否需要发送整个键盘的状态
❌ 不需要,只需发送影响模拟的按键状态即可。

那么,发送按键的按下/释放事件怎么样?
❌ 这也不是一个好策略,因为我们需要确保右侧的模拟在完全相同的时间应用完全相同的输入。如果仅仅通过 TCP 发送**“按键按下”“按键释放”**事件,可能会导致时序上的偏差。

   struct Input
    {
        bool left;
        bool right;
        bool up;
        bool down;
        bool space;
        bool z;
    };

Next we send that input from the left simulation to the right simulation in a way that the simulation on the right side knows that the input belongs to frame n.

And here’s the key part: the simulation on the right can only simulate frame n when it has the input for that frame. If it doesn’t have the input, it has to wait.

For example, if you were sending across using TCP you could simply send the inputs and nothing else, and on the other side you could read the packets coming in, and each input received corresponds to one frame for the simulation to step forward. If no input arrives for a given render frame, the right side can’t advance forward, it has to wait for the next input to arrive.

So let’s move forward with TCP, you’ve disabled Nagle’s Algorithm, and you’re sending inputs from the left to the right simulation once per-frame (60 times per-second).

Here it gets a little complicated. Since we can’t simulate forward unless we have the input for the next frame, it’s not enough to just take whatever inputs arrive over the network and then run the simulation on inputs as they arrive because the result would be very jittery. Data sent across the network at 60HZ doesn’t typically arrive nicely spaced, 1/60th of a second between each packet.

If you want this sort of behavior, you have to implement it yourself.

接下来,我们需要将左侧模拟的输入发送到右侧模拟,同时确保右侧模拟知道这些输入属于第 n 帧

关键点

右侧模拟只有在收到第 n 帧的输入后,才能模拟第 n 帧
如果它没有收到该帧的输入,它就必须等待

举个例子,如果使用 TCP 进行数据传输,你可以简单地只发送输入数据,然后在另一端读取收到的数据包,每个接收到的输入对应一帧模拟的前进
但如果某一帧的输入数据没有到达,右侧模拟就无法继续前进,而是必须等待下一次输入的到来

基于 TCP 传输输入

假设我们使用 TCP 进行传输,并且:

  • 禁用了 Nagle 算法(防止小数据包延迟)。

  • 每帧发送一次输入(60 次/秒)。

这里会出现一个问题:网络传输的输入数据并不会总是均匀分布,每 1/60 秒精准到达,这会导致模拟变得非常抖动(jittery)
因此,我们不能仅仅按照数据到达的顺序立即运行模拟,否则效果会很糟糕。

如果你想要稳定的输入间隔,你必须自己实现这种行为

 

Playout Delay Buffer

Such a device is called a playout delay buffer.

Unfortunately, the subject of playout delay buffers is a patent minefield. I would not advise searching for “playout delay buffer” or “adaptive playout delay” while at work. But in short, what you want to do is buffer packets for a short amount of time so they appear to be arriving at a steady rate even though in reality they arrive somewhat jittered.

What you’re doing here is similar to what Netflix does when you stream a video. You pause a little bit initially so you have a buffer in case some packets arrive late and then once the delay has elapsed video frames are presented spaced the correct time apart. If your buffer isn’t large enough then the video playback will be hitchy. With deterministic lockstep your simulation behaves exactly the same way: showing hitches when the buffer isn’t large enough to smooth out the jitter. Of course, the cost of increasing the buffer size is additional latency, so you can’t just buffer your way out of all problems. At some point the user says enough! That’s too much latency added. No sir, I will not play your game with 1 second of extra delay :)

播放延迟缓冲区(Playout Delay Buffer)

这样的方式被称为播放延迟缓冲区(Playout Delay Buffer)

它的作用是什么?

简而言之,你需要对数据包进行短暂缓冲,以便让它们看起来稳定到达的,尽管它们实际上可能存在抖动(Jitter)

这种技术类似于 Netflix 在播放视频时的缓冲机制

  1. 开始播放前先进行一点点缓冲,确保有足够的数据可用,以防某些数据包延迟到达。

  2. 等到缓冲时间过去后,视频帧会以正确的时间间隔呈现,保证播放流畅。

  3. 如果缓冲区不够大,视频就会卡顿(Hitchy)

在确定性锁步中的作用

在**确定性锁步(Deterministic Lockstep)**中,模拟的行为与视频播放类似:

  • 如果缓冲区不足,模拟就会出现卡顿,因为它无法填补网络数据传输的抖动。

  • 增大缓冲区可以平滑模拟,但代价是增加额外的延迟(Latency)

然而,你不能无限制地增加缓冲区大小来解决问题。
因为玩家最终会忍受不了过大的延迟!
“1 秒额外延迟?不行!我拒绝玩你的游戏!”

My playout delay buffer implementation is really simple. You add inputs to it indexed by frame, and when the very first input is received, it stores the current local time on the receiver machine and from that point on delivers packets assuming they should play at that time + 100ms. You’ll likely need to something more complex for a real world situation, perhaps something that handles clock drift, and detecting when the simulation should slightly speed up or slow down to maintain a nice amount of buffering safety (being “adaptive”) while minimizing overall latency, but this is reasonably complicated and probably worth an article in itself

我的播放延迟缓冲区实现

我的播放延迟缓冲区实现非常简单:

  • 输入数据按帧索引存入缓冲区

  • 当接收到第一个输入时,记录当前接收端机器的本地时间

  • 之后,每个输入都假定应在 (记录的时间 + 100ms) 时刻播放

更复杂的实际应用

实际应用中,你可能需要一个更复杂的方案,例如:

  1. 处理时钟漂移(Clock Drift),确保不同设备之间的时钟同步

  2. 检测何时需要调整模拟的速度,让它稍微加快减慢,以维持合理的缓冲区安全值

  3. 做到“自适应”(Adaptive),在尽量减少整体延迟的同时,保持平稳的模拟体验。

不过,这些优化相当复杂,可能值得单独写一篇文章来深入探讨!

 

The goal is that under average conditions the playout delay buffer provides a steady stream of inputs for frame n, n+1, n+2 and so on, nicely spaced 1/60th of a second apart with no drama.

In the worst case the time arrives for frame n and the input hasn’t arrived yet it returns null and the simulation is forced to wait.

If packets get bunched up and delivered late, it’s possibly to have multiple inputs ready to dequeue per-frame.

In this case I limit to 4 simulated frames per-render frame so the simulation has a chance to catch up, but doesn’t simulate for so long that it falls further behind, aka. the “spiral of death”.

目标是,在平均条件下播放延迟缓冲区能够提供一个稳定的输入流,按照1/60秒的间隔依次提供第n帧第n+1帧第n+2帧,并且无卡顿

最坏情况下,当第n帧的时间到达时,如果输入还没有到达,则返回null,并且模拟会被迫等待

如果数据包出现堆积并且延迟到达,可能会有多个输入可以在每一帧进行出队处理
在这种情况下,我会限制每渲染帧最多模拟4帧,这样可以让模拟有机会赶上进度,但不会模拟太长时间,以至于进一步落后,避免出现所谓的**“死亡螺旋”(Spiral of Death)**。

Is TCP good enough?

Using this playout buffer strategy and sending inputs across TCP we ensure that all inputs arrive reliably and in-order. This is convenient, and after all, TCP is designed for exactly this situation: reliable-ordered data.

In fact, It’s a common thing out there on the Internet for pundits to say stuff like:

But I’m here to tell you this kind of thinking is dead wrong.

https://gafferongames.com/videos/deterministic_lockstep_tcp_100ms_1pc.mp4

Above you can see the simulation networked using deterministic lockstep over TCP at 100ms latency and 1% packet loss. If you look closely on the right side you can see hitches every few seconds. What’s happening here is that each time a packet is lost, TCP has to wait RTT*2 while it is resent (actually it can be much worse, but I’m being generous…). The hitches happen because with deterministic lockstep the right simulation can’t simulate frame n without input n, so it has to pause to wait for input n to be resent!

That’s not all. It gets significantly worse as latency and packet loss increase. Here is the same simulation networked using deterministic lockstep over TCP at 250ms latency and 5% packet loss:

https://gafferongames.com/videos/deterministic_lockstep_tcp_250ms_5pc.mp4

Now I will concede that if you have no packet loss and/or a very small amount of latency then you very well may get acceptable results with TCP. But please be aware that if you use TCP it behaves terribly under bad network conditions.

Can we do better than TCP?

Can we beat TCP at its own game. Reliable-ordered delivery?

The answer is an emphatic YES. But only if we change the rules of the game.

Here’s the trick. We need to ensure that all inputs arrive reliably and in order. But if we send inputs in UDP packets, some of those packets will be lost. What if, instead of detecting packet loss after the fact and resending lost packets, we redundantly include all inputs in each UDP packet until we know for sure the other side has received them?

这里有一个诀窍。我们需要确保所有输入都可靠且按顺序到达。但是,如果我们通过 UDP 发送输入,一些数据包可能会丢失。

那么,如果我们不等到检测到丢包后再重传,而是在每个 UDP 数据包中冗余地包含所有输入,直到确认对方已经接收到它们,会怎么样呢?

Inputs are very small (6 bits). Let’s say we’re sending 60 inputs per-second (60fps simulation) and round trip time we know is going the be somewhere in 30-250ms range. Let’s say just for fun that it could be up to 2 seconds worst case and at this point we’ll time out the connection (screw that guy). This means that on average we only need to include between 2-15 frames of input and worst case we’ll need 120 inputs. Worst case is 120*6 = 720 bits. That’s only 90 bytes of input! That’s totally reasonable.

输入数据非常小(仅 6 比特)。假设我们每秒发送 60 个输入(对应 60fps 的模拟),而已知的往返时间(RTT) 介于 30-250ms 之间。

再假设,为了好玩,我们把最坏情况设为 2 秒,超过这个时间就直接超时断开连接(反正那家伙也玩不了)。这意味着,平均情况下,我们只需要在每个数据包中包含 2-15 帧 的输入,而最坏情况下,需要存储 120 帧的输入

最坏情况下:
120 帧 × 6 比特 = 720 比特,也就是 90 字节 的输入数据!

这完全可以接受。

We can do even better. It’s not common for inputs to change every frame. What if when we send our packet instead we start with the sequence number of the most recent input, and the 6 bits of the first (oldest) input, and the number of un-acked inputs. Then as we iterate across these inputs to write them to the packet we can write a single bit (1) if the next input is different to the previous, and (0) if the input is the same. So if the input is different from the previous frame we write 7 bits (rare). If the input is identical we write just one (common). Where inputs change infrequently this is a big win and in the worst case this really isn’t that bad. 120 bits of extra data sent. Just 15 bytes overhead worst case.

我们还能做得更好!输入并不会每一帧都发生变化。

如果我们在发送数据包时,先包含最新输入的序列号最早(最旧)的 6 位输入,以及未被确认的输入数量,然后在遍历这些输入并写入数据包时,添加一个单比特标记

  • 如果下一个输入与前一个不同,则写入 1 并存储 6 位新输入(共 7 位,但这很少发生)。

  • 如果下一个输入与前一个相同,则只写入 0(只需 1 位,而这通常是主要情况)。

这样,当输入变化不频繁时,我们可以大幅减少数据量。即使在最坏情况下,额外发送的数据也只有 120 位(15 字节),这点开销完全可以接受。

Of course another packet is required from the right simulation to the left so the left side knows which inputs have been received. Each frame the right simulation reads input packets from the network before adding them to the playout delay buffer and keeps track of the most recent input it has received and sends this back to the left as an “ack” or acknowledgment for inputs.

When the left side receives this ack it discards any inputs older than the most recent received input. This way we have only a small number of inputs in flight proportional to the round trip time between the two simulations.

当然,右侧模拟还需要发送一个数据包回到左侧,以便左侧知道哪些输入已经被接收。

每一帧,右侧模拟会从网络中读取输入数据包,在将其添加到播放延迟缓冲区之前,跟踪它接收到的最新输入,并将这个信息作为**“ack”(确认)** 发送回左侧。

当左侧收到这个 ack 时,它会丢弃比最新接收输入更早的所有输入

这样,我们在网络中传输的输入数量始终保持在一个与往返时间(RTT)成比例的较小范围,避免了不必要的冗余数据。

Flawless Victory

We have beaten TCP by changing the rules of the game.

Instead of “implementing 95% of TCP on top of UDP” we have implemented something totally different and better suited to our requirements. A protocol that redundantly sends inputs because we know they are small, so we never have to wait for retransmission.

So exactly how much better is this approach than sending inputs over TCP?

Let’s take a look…

完美胜利!
我们通过改变游戏规则,彻底击败了 TCP。

我们并没有在 UDP 之上实现 95% 的 TCP,而是创造了一种完全不同、更加符合我们需求的协议

这个协议通过冗余发送输入(因为我们知道它们很小),使得我们永远不需要等待重传

那么,这种方法相比通过 TCP 发送输入,究竟能好多少呢?

让我们来看看吧……

https://gafferongames.com/videos/deterministic_lockstep_udp_2sec_25pc.mp4

The video above shows deterministic lockstep synchronized over UDP using this technique with 2 seconds of latency and 25% packet loss. Imagine how awful TCP would look under these conditions. So in conclusion, even where TCP should have the most advantage, in the only networking model that relies on reliable-ordered data, we can still easily whip its ass with a simple protocol built on top of UDP.

上面的视频展示了使用这种技术在2秒延迟25%丢包率的情况下,通过 UDP 实现的确定性锁步同步。想象一下,在这种情况下,如果使用 TCP 会变得多么糟糕。

总之,即使在 TCP 应该具有最大优势的地方——即在唯一依赖于可靠有序数据的网络模型中,我们仍然能够轻松击败它,使用一个简单的基于 UDP 的协议。

posted @ 2025-04-01 15:21  sun_dust_shadow  阅读(33)  评论(0)    收藏  举报