Deterministic Lockstep

In the previous article, we discussed the properties of the physics simulation we’re going to network. In this article we’ll network that physics simulation using the deterministic lockstep technique.

上一篇文章里，讨论了哪些物理模拟的变量需要网络传输，本文将讨论使用确定性同步技术，这个lock step的意思，就是大家的步伐一致，同步的走，谁超了就走不了只能等，落下了的要赶上，否则其他人也走不了，类似于让大家把自己的腿都绑住，一步步走，以前部队里常用这种方式，现在搞的一些团队训练里也搞这种。可以看看这个文章：http://ju.outofmemory.cn/entry/119375。从同步的现象上看，坦克世界里用的同步方法就是这种实现。比如一局内所有的坦克都需要同步运动，但是和谁同步呢。

Deterministic lockstep is a method of synchronizing a system from one computer to another by sending only the inputs that control that simulation, rather than networking the state of the objects in the simulation itself. The idea is that given initial state S(n) we run the simulation using input I(n) to get S(n+1). We then take S(n+1) and input I(n+1) to get S(n+2), repeating this process for n+3, n+4 and so on. It’s sort of like a mathematical induction where we step forward with only the input and the previous simulation state – keeping the state perfectly in sync without ever actually sending it.

这个确定的锁步，翻译起来真是拗口，还是用原词吧，这种同步的数据是系统的输入，而不是状态本身。系统使用上一个turn的游戏状态＋输入，等于下个游戏状态。这样能确保所有状态一致。和数学上的归纳法很像。

The main benefit of this network model is that the bandwidth required to transmit the input is independent of the number of objects in the simulation. You can network a physics simulation of one million objects with exactly the same amount of bandwidth as a simulation with just one. It’s easy to see that with the state of physics objects typically consisting of a position, orientation, linear and angular velocity (52 bytes uncompressed, assuming a quaternion for orientation and vec3 for everything else) that this can be an attractive option when you have a large number of physics objects.

这种网络模型所需要的带宽，与需要同步状态的对象数量无关，因为只需要同步玩家的输入嘛，这个输入控制的体量是固定的。一百万个物理对象的状态模拟和一个物理对象的状态模拟所需要的网络带宽是一致的。如果需要同步的对象非常多，就很合适了。

To network your physics simulation using deterministic lockstep you first need to ensure that your simulation is deterministic. Determinism in this context has very little to do with free will. It simply means that given the same initial condition and the same set of inputs your simulation gives exactly the same result. And I do mean exactly the same result. Not near enough within floating point tolerance. Exact down to the bit-level. So exact you could take a checksum of your entire physics state at the end of each frame and it would be identical.

什么是确定性的模拟？意思就是说，一个对象当前的状态＋输入，等于相同的结果，精确到比特位的相同。所以在每帧结束的时候，要通过校验来确保物理状态的一致性。

Above you can see a simulation that is almost deterministic but not quite. The simulation on the left is controlled by the player. The simulation on the right has exactly the same inputs applied with a two second delay starting from the same initial condition. Both simulations step forward with the same delta time (a necessary precondition to ensure exactly the same result) and apply the same inputs before each frame. Notice how after the smallest divergence the simulation gets further and further out of sync. This simulation is non-deterministic.

作者搞了个对比视频来说明这件事。两个模拟的帧时间一样，每帧使用相同的输入值。上面的模拟可以看出来几乎是确定的，但是仍然不是一样的。左边是玩家输入得到的结果，右边是根据输入有2秒钟的延迟所得到的结果。这种就不是确定性模拟了。

What’s going on above is that the physics engine I’m using (ODE) uses a random number generator inside its solver to randomize the order of constraint processing to improve stability. It’s open source. Take a look and see! Unfortunately this breaks determinism because the simulation on the left processes constraints in a different order to the simulation on the right, leading to slightly different results.

Luckily all that is required to make ODE deterministic on the same machine, with the same complied binary and on the same OS (is that enough qualifications?) is to set its internal random seed to the current frame number before running the simulation via dSetRandomSeed. Once this is done ODE gives exactly the same result and the left and right simulations stay in sync.

上面的例子，物理引擎使用了一个随机数生成器的东西来处理稳定性问题，ode提供了一种方法，让你设置相同的随机数种子，这样就得到了相同的模拟结果

And now a word of warning. Even though the ODE simulation above is deterministic on the same machine, that does not necessarily mean it would also be deterministic across different compilers, a different OS or different machine architectures (eg. PowerPC vs. Intel). In fact, it’s probably not even deterministic between your debug and release build due to floating point optimizations. Floating point determinism is a complicated subject and there is no silver bullet. For more information please refer to this article.

注意，ode的模拟确定性，很可能在不同的编译器，不同的操作系统，不同的cpu架构下都有可能产生差异，甚至debug版本和release版本下鱿鱼浮点数的优化问题都有可能产生不一致的结果。浮点数的一致性可是个大话题，暂且不表。所以都使用了整数来进行同步。

Now lets talk about implementation.

You may wonder what the input in our example simulation is and how we should network it. Well, our example physics simulation is driven by keyboard input: arrow keys apply forces to make the player cube move, holding space lifts the cube up and blows other cubes around, and holding ‘z’ enables katamari mode.

键盘输入控制：方向键给cube加力使其移动，按住空格键，举起cube

But how can we network these inputs? Must we send the entire state of the keyboard? Do we send events when these keys are pressed and released? No. It’s not necessary to send the entire keyboard state, only the state of the keys that affect the simulation. What about key press and release events then? No. This is also not a good strategy. We need to ensure that exactly the same input is applied on the right side when simulating frame n, so we can’t just send ‘key pressed’, and ‘key released events’ across using TCP, as they may arrive earlier or later than frame n causing the simulations to diverge.

现在确定要传键盘的输入操作，并不是要传输所有的键盘状态，只需要传输哪些会影响到模拟结果的输入，但是具体要传哪些，是键的按下与松开？但是，假设你在第n帧发送了，这种按键事件，在其他机器上的第n帧，也许根本就没监听到这个事件，有可能早到了，也就是说其他机器比较慢，只跑到了n－5帧，或者在某些快的机器上已经跑到了n ＋ 5帧，所以你没办法真的同步第n帧的输入事件。

What we do instead is represent the input with a struct and at the beginning of each simulation frame on the left side, sample this struct from the keyboard and stash it away in a sliding window so we can access the input later on indexed by frame number.

struct Input
{
    bool left;
    bool right;
    bool up;
    bool down;
    bool space;
    bool z;
};

真实做法是什么呢，是使用一个结构体封装输入事件，在每帧的开始前保存当前的键盘事件，并存储到一个发送窗口里，这样的话，就能确切的知道某一帧的输入了。

Now we send that input from the left simulation to the right simulation in a way such that that simulation on the right side knows that the input belongs to frame n. For example, if you were sending across using TCP you could simply send the inputs and nothing else, and the order of the inputs implies n. On the other side you could read the packets coming in, and process the inputs and apply them to the simulation. I don’t recommend this approach but lets start here and I’ll show you how it can be made better.

现在玩家的输入行为，通过网络传给了其他玩家，在其他玩家的机器上知道了第n帧发生了什么事情。

So lets say you’re using TCP, you’ve disabled Nagle’s algorithm and you’re sending inputs from the left to the right simulation once per-frame (60 times per-second).

所以，假定我们使用tcp，每帧都发送输入事件。

Here it gets a little complicated. It’s not enough to just take whatever inputs arrive over the network and then run the simulation on inputs as they arrive because the result would be very jittery. You can’t just send data across the network at a certain rate and expect it to arrive nicely spaced out at at exactly the same rate on the other side (eg. 1/60th of a second apart). The internet doesn’t work like that. It makes no such guarantee.

但是结果是非常不确定的，由于互联网的工作状态并不能达到在你和其他玩家之间的传输速度一致，你发送的频率和其他玩家接收到的时机很可能完全配不上。

If you want this you have to implement something called a playout delay buffer. Unfortunately, the subject of playout delay buffers is a patent minefield. I would not advise searching for “playout delay buffer” or “adaptive playout delay” while at work. But in short, what you want to do is buffer packets for a short amount of time so they appear to be arriving at a steady rate even though in reality they arrive somewhat jittered.

What you’re doing here is similar to what Netflix does when you stream a video. You pause a little bit initially so you have a buffer in case some packets arrive late and then once the delay has elapsed video frames are presented spaced the correct time apart. Of course if your buffer isn’t large enough then the video playback will be hitchy. With deterministic lockstep your simulation will behave exactly the same way. I recommend 100-250ms playout delay. In the examples below I use 100ms because I want to minimize latency added for responsiveness.

这里的推荐做法很像是缓存视频，暂停下，然后就会收到视频包，这样你本地就能愉快滴模拟了。playout delay这个词得好好研究下。

My playout delay buffer implementation is really simple. You add inputs to it indexed by frame, and when the very first input is received, it stores the current local time on the receiver machine and from that point on delivers all packets assuming that frame 0 starts at that time + 100ms. You’ll likely need to something more complex for a real world situation, perhaps something that handles clock drift, and detecting when the simulation should slightly speed up or slow down to maintain a nice amount of buffering safety (being “adaptive”) while minimizing overall latency, but this is reasonably complicated and probably worth an article in itself, and as mentioned a bit of a patent minefield so I’ll leave this up to you.

作者的做法会简单些，输入＋帧号，当收到网络上的输入事件时，存下本地的时间，从第一次收到事件起，假定它的第0帧，就是从当前的本地时间＋ 100毫秒。

这里我也没弄明白，需要去看看playout delay的文章才行。

In average conditions the playout delay buffer provides a steady stream of inputs for frame n, n+1, n+2 and so on, nicely spaced 1/60th of a second apart with no drama. In the worst case the time arrives for frame n and the input hasn’t arrived yet it returns null and the simulation is forced to wait. If packets get bunched up and delivered late, it’s possibly to have multiple inputs ready to dequeue per-frame. In this case I limit to 4 simulated frames per-render frame to give the simulation a chance to catch up. If you set it much higher you may induce further hitching as you take longer than 1/60th of a second to run those frames (this can create an unfortunate feedback effect). In general, it’s important to make sure that you are not CPU bound while using deterministic lockstep technique otherwise you’ll have trouble running extra simulation frames to catch up.

最坏的情况是，在第n帧的时候，那一帧的输入还没有到，模拟就得等。如果一下子来了许多次的输入，那在一个渲染帧里就得取出多个输入。作者是限制了，每个渲染帧最多只能跑四个模拟帧。如果游戏是cpu耗费较高，那就得考量下了，因为必然会在一帧里做多次的模拟计算。

Using this playout buffer strategy and sending inputs across TCP we can easily ensure that all inputs arrive reliably and in-order. This is what TCP is designed to do after all. In fact, it’s a common thing out there on the Internet for pundits to say stuff like:

Above you can see the simulation networked using deterministic lockstep over TCP at 100ms latency and 1% packet loss. If you look closely on the right side you can see infrequent hitching every few seconds. I apologize if you have hitching on both sides that means your computer is struggling to play the video. Maybe download it and watch offline if that is the case. Anyway, what is happening here is that when a packet is lost, TCP has to wait RTT*2 before resending it (actually it can be much worse, but I’m being generous…). The hitches happen because with deterministic lockstep the right simulation can’t simulate frame n without input n, so it has to pause to wait for input n to be resent!

使用确定性锁步所带来的卡顿问题，是因为在其他机器上模拟的时候，必须要有第n帧的输入，才能模拟第n帧的渲染，因此在第n帧的数据没有到来前，只能等待了。没有第N帧输入数据，就无法执行第N帧的渲染，这个感觉好可怕。

That’s not all. It gets significantly worse as the amount of latency and packet loss increases. Here is the same simulation networked using deterministic lockstep over TCP at 250ms latency and 5% packet loss:

随着网络延时和丢包率的增加，这种卡顿会非常严重。作者演示了在250毫秒延时＋5%的丢包率的情况下的表现。简直就是在看ppt

Now I will concede that if you have no packet loss and/or a very small amount of latency then you very well may get acceptable results with TCP. But please be aware that if you use TCP to send time critical data it degrades terribly as packet loss and latency increases.

作者认为如果丢包率很低，而且延时很小，那么tcp能达到效果，否则tcp协议太依赖较低的丢包率和较低的网络延时了。

Can we do better? Can we beat TCP at its own game. Reliable-ordered delivery?

The answer is an emphatic YES. But only if we change the rules of the game.

Here’s the trick. We need to ensure that all inputs arrive reliably and in order. But if we just send inputs in UDP packets, some of those packets will be lost. What if, instead of detecting packet loss after the fact and resending lost packets, we just redundantly send all inputs we have stored until we know for sure that the other side has received them?

Inputs are very small (6 bits). Lets say we’re sending 60 inputs per-second (60fps simulation) and round trip time we know is going the be somewhere in 30-250ms range. Lets say just for fun that it could be up to 2 seconds worst case and at this point we’ll time out the connection (screw that guy). This means that on average we only need to include between 2-15 frames of input and worst case we’ll need 120 inputs. Worst case is 120*6 = 720 bits. That’s only 90 bytes of input! That’s totally reasonable.

We can do even better. It’s not common for inputs to change every frame. What if when we send our packet instead we start with the sequence number of the most recent input, and the 6 bits of the first (oldest) input, and the number of un-acked inputs. Then as we iterate across these inputs to write them to the packet we can write a single bit (1) if the next input is different to the previous, and (0) if the input is the same. So if the input is different from the previous frame we write 7 bits (rare). If the input is identical we write just one (common). Where inputs change infrequently this is a big win and in the worst case this really isn’t that bad. 120 bits of extra data sent. Just 15 bytes overhead worst case.

Of course another packet is required from the right simulation to the left so the left side knows which inputs have been received. Each frame the right simulation reads input packets from the network before adding them to the playout delay buffer and keeps track of the most recent input it has received by frame number, or if you want to get fancy, a sequence number of only 16 bits that handles wrapping. Then after all input packets are processed, if any input was received that frame the right simulation replies back to the left simulation telling it the most recent input sequence number it has received, e.g. an “ack” or acknowledgment.

When the left simulation receives this ack it takes its sliding window of inputs and discards any inputs older than the acked sequence number. There is no need to send these inputs to the right simulation anymore because we know it has already received them. This way we typically have only a small number of inputs in flight proportional to the round trip time between the two simulations.

We have beaten TCP by changing the rules of the game. Instead of “implementing 95% of TCP on top of UDP” we have implemented something quite different and better suited to our requirements: time critical data. We developed a custom protocol that redundantly sends all un-acked inputs that can handle large amounts of latency and packet loss without degrading the quality of the synchronization or hitching.

So exactly how much better is this approach than sending the data over TCP?

Take a look.

The video above is deterministic lockstep synchronized over UDP using this technique with 2 seconds of latency and 25% packet loss. In fact, if I just increase the playout delay buffer from 100ms to 250ms I can get the code running smoothly at 50% packet loss. Imagine how awful TCP would look in these conditions!

So in conclusion: even where TCP should have the most advantage, in the only networking model I’ll present to you in this article series that relies on reliable-ordered data, we can easily whip its ass with a simple custom protocol sent over UDP.

作者分析了使用可靠的udp方案来发送数据，从而降低网络延时。

但是这种方案目前来说，对我并不合适。

posted @ 2015-11-03 17:36 DesignYourDream 阅读(285) 评论(0) 收藏举报

刷新页面返回顶部

DesignYourDream

Deterministic Lockstep

公告