(翻译 gafferongames) Packet Fragmentation and Reassembly 数据包分片与重组

https://gafferongames.com/post/packet_fragmentation_and_reassembly/

Now we are ready to start putting interesting things in our packets and sending them over the network, but immediately we run into an interesting question: how big should our packets be?

To answer this question properly we need a bit of background about how packets are actually sent over the Internet.

现在我们已经准备好在数据包中放入有趣的内容并通过网络发送它们,但我们立刻遇到了一个有趣的问题:我们的数据包应该有多大?

要正确回答这个问题,我们需要先了解一些关于数据包在互联网上是如何实际发送的背景知识。

Background

Perhaps the most important thing to understand about the internet is that there’s no direct connection between the source and destination IP address. What actually happens is that packets hop from one computer to another to reach their destination.

Each computer along this route enforces a maximum packet size called the maximum transmission unit, or MTU. According to the IP standard, if any computer recieves a packet larger than its MTU, it has the option of a) fragmenting that packet, or b) dropping the packet.

So here’s how this usually goes down. People write a multiplayer game where the average packet size is quite small, lets say a few hundred bytes, but every now and then when a lot of stuff is happening in their game and a burst of packet loss occurs, packets get a lot larger than usual, going above MTU for the route, and suddenly all packets start getting dropped!

Just last year (2015) I was talking with Alex Austin at Indiecade about networking in his game Sub Rosa. He had this strange networking bug he couldn’t reproduce. For some reason, players would randomly get disconnected from the game, but only when a bunch of stuff was going on. It was extremely rare and he was unable to reproduce it. Alex told me looking at the logs it seemed like packets just stopped getting through.

This sounded exactly like an MTU issue to me, and sure enough, when Alex limited his maximum packet size to a reasonable value the bug went away.

也许关于互联网最重要的一点是:源 IP 地址与目标 IP 地址之间没有直接连接。实际上,数据包是通过一台接一台的计算机跳跃(hop)传输,最终到达目的地。

这些路径中的每台计算机都会强制执行一个最大数据包大小的限制,叫做 最大传输单元(MTU, Maximum Transmission Unit)

根据 IP 协议的标准,如果某台计算机接收到一个超过其 MTU 的数据包,它可以选择:

a) 对该数据包进行分片(fragmentation),或者
b) 直接**丢弃(drop)**该数据包。

这通常是如何出问题的:开发者写了一个多人游戏,平时的数据包都很小,比如几百字节。但当游戏中发生大量事件,尤其是在突发丢包的情况下,数据包突然变得比平时大,超过了传输路径上的 MTU,结果所有的数据包都开始被丢弃!

就在去年(2015 年),我在 Indiecade 上和 Alex Austin 聊他开发的游戏《Sub Rosa》的网络问题。他遇到了一个奇怪的 bug,无法复现:有时候,玩家会随机断线,而且只会在游戏中发生大量事件的时候才出现。这个问题极其罕见,而且 Alex 无法定位原因。他告诉我,从日志上看,好像是数据包突然完全发不过去了

这听起来完全就是一个 MTU 问题,果不其然,当 Alex 把最大数据包大小限制在一个合理的范围内后,这个 bug 就彻底消失了

MTU in the real world

So what’s a reasonable maximum packet size?

On the Internet today (2016, IPv4) the real-world MTU is 1500 bytes.

Give or take a few bytes for UDP/IP packet header and you’ll find that the typical number before packets start to get dropped or fragmented is somewhere around 1472.

那么,什么才算是一个合理的最大数据包大小呢?

在当今的互联网环境中(2016 年,IPv4),现实中的 MTU(最大传输单元)通常是 1500 字节

考虑到 UDP/IP 协议头会占用一些字节,一般来说,在大约 1472 字节左右的数据包大小以上,就有可能发生丢包或分片

Why 1500? That’s the default MTU for MacOS X. It’s also the default MTU on Windows. So now we have an upper bound for your packet size assuming you actually care about packets getting through to Windows and Mac boxes without IP level fragmentation or a chance of being dropped: 1472 bytes.

为什么是 1500?因为 1500 字节是 macOS X 的默认 MTU,同时也是 Windows 的默认 MTU

所以,如果你希望你的数据包在 Windows 和 Mac 设备之间传输时不会被 IP 层分片或丢弃,那么你就得遵守这个限制。

也就是说,你的数据包大小上限应该设为 1472 字节(扣除 UDP/IP 协议头后的有效载荷大小)。

So what’s the lower bound? Unfortunately for the routers in between your computer and the destination the IPv4 standard says 576. Does this mean we have to limit our packets to 400 bytes or less? In practice, not really.

MacOS X lets me set MTU values in range 1280 to 1500 so considering packet header overhead, my first guess for a conservative lower bound on the IPv4 Internet today would be 1200 bytes. Moving forward, in IPv6 this is also a good value, as any packet of 1280 bytes or less is guaranteed to get passed on without IP level fragmentation.

This lines up with numbers that I’ve seen throughout my career. In my experience games rarely try anything complicated like attempting to discover path MTU, they just assume a reasonably conservative MTU and roll with that, something like 1000 to 1200 bytes of payload data. If a packet larger than this needs to be sent, it’s split up into fragments by the game protocol and re-assembled on the other side.

And that’s exactly what I’m going to show you how to do in this article.

那下限是多少?

不幸的是,对于你电脑和目标之间的路由器来说,IPv4 标准要求最小 MTU 是 576 字节。这是否意味着我们的数据包必须限制在 400 字节或更少?实际上,并不需要那么保守。

在 macOS X 上,我可以设置的 MTU 范围是 1280 到 1500 字节,所以考虑到协议头的开销,我对 现代 IPv4 网络上保守下限的估计是 1200 字节左右。

向前看,在 IPv6 中,这个数值同样适用——因为 任何小于等于 1280 字节的包都能保证在不被 IP 层分片的情况下传输。

这个估值也符合我职业生涯中观察到的情况。以我的经验来看,大多数游戏都不会搞什么复杂的路径 MTU 发现,而是直接假设一个相对保守的 MTU 值,比如 1000 到 1200 字节的有效负载数据。

如果需要发送更大的包,那就由游戏协议本身进行分片,然后在接收端重组。

Fragment Packet Structure

Let’s get started with implementation.

The first thing we need to decide is how we’re going to represent fragment packets over the network so they are distinct from non-fragmented packets.

Ideally, we would like fragmented and non-fragmented packets to be compatible with the existing packet structure we’ve already built, with as little overhead as possible in the common case when we are sending packets smaller than MTU.

现在我们开始进入实现部分。

首先要决定的是:我们如何在网络上传输分片数据包,以便它们能和普通(未分片)数据包区分开来。

理想情况下,我们希望分片数据包与我们已有的数据包结构兼容,并且在大多数情况下(即发送的包小于 MTU,不需要分片)时尽量减少额外开销。也就是说,普通数据包应该保持简单高效,而分片数据包则只在必要时增加额外信息。

Here’s the packet structure from the previous article:

[protocol id] (64 bits) // not actually sent, but used to calc crc32 
[crc32] (32 bits) 
[packet type] (2 bits for 3 distinct packet types)
(variable length packet data according to packet type) 
[end of packet serialize check] (32 bits)

In our protocol we have three packet types: A, B and C.

Let’s make one of these packet types generate really large packets:

static const int MaxItems = 4096 * 4;

struct TestPacketB : public Packet
{
    int numItems;
    int items[MaxItems];

    TestPacketB() : Packet( TEST_PACKET_B )
    {
        numItems = random_int( 0, MaxItems );
        for ( int i = 0; i < numItems; ++i )
            items[i] = random_int( -100, +100 );
    }

    template <typename Stream> bool Serialize( Stream & stream )
    {
        serialize_int( stream, numItems, 0, MaxItems );
        for ( int i = 0; i < numItems; ++i )
        {
            serialize_int( stream, items[i], -100, +100 );
        }
        return true;
    }
};

This may seem somewhat contrived but these situations really do occur. For example, if you have a strategy where you send all un-acked events from server to client and you hit a burst of packet loss, you can easily end up with packets larger than MTU, even though your average packet size is quite small.

Another common case is delta encoded snapshots in a first person shooter. Here packet size is proportional to the amount of state changed between the baseline and current snapshots for each client. If there are a lot of differences between the snapshots the delta packet is large and there’s nothing you can do about it except break it up into fragments and re-assemble them on the other side.

Getting back to packet structure. It’s fairly common to add a sequence number at the header of each packet. This is just a packet number that increases with each packet sent. I like to use 16 bits for sequence numbers even though they wrap around in about 15 minutes @ 60 packets-per-second, because it’s extremely unlikely that a packet will be delivered 15 minutes late.

这看起来可能有些“人为制造”的复杂情况,但这种情况在现实中确实会发生

比如说,如果你的服务器向客户端发送所有尚未确认(un-acked)的事件,而恰好遇到一波突发的数据包丢失,那么就可能在一次传输中堆积大量事件,导致单个数据包大小超过 MTU。哪怕平时你的平均数据包非常小,这种情况也可能突然出现。

另一个常见的例子是:第一人称射击游戏中的“增量编码快照(delta encoded snapshots)”。在这种机制中,数据包大小取决于客户端当前快照与基准快照之间的状态差异
如果两次快照差异很大,增量数据包就会变得很大,这时候你别无选择,只能把数据包分片,然后在客户端重新组装(re-assemble)

回到数据包结构,在设计数据包结构时,一个非常常见的做法是:在每个数据包头部加一个序列号(sequence number)

这个序列号只是一个简单的递增数字,每发送一个数据包就加 1

我个人喜欢用 16 位的序列号。虽然这意味着在大约 15 分钟内(假设每秒发送 60 个数据包)就会循环一次,但这完全可以接受——因为几乎不可能有数据包延迟长达 15 分钟才送达

Sequence numbers are useful for a bunch of things like acks, reliability and detecting and discarding out of order packets. In our case, we’re going to use the sequence number to identify which packet a fragment belongs to:

序列号在很多方面都非常有用,比如用于确认(ack)、可靠性(reliability),以及检测和丢弃乱序的数据包。

在我们的情况下,我们将使用序列号来标识每个分片属于哪个原始数据包:

[protocol id] (64 bits)   // not actually sent, but used to calc crc32
[crc32] (32 bits)
[sequence] (16 bits)
[packet type] (2 bits)
(variable length packet data according to packet type)
[end of packet serialize check] (32 bits)

Here’s the interesting part. Sure we could just add a bit is_fragment to the header, but then in the common case of non-fragmented packets you’re wasting one bit that is always set to zero.

What I do instead is add a special fragment packet type:

enum TestPacketTypes
{
    PACKET_FRAGMENT = 0,     // RESERVED 
    TEST_PACKET_A,
    TEST_PACKET_B,
    TEST_PACKET_C,
    TEST_PACKET_NUM_TYPES
};

And it just happens to be free because four packet types fit into 2 bits. Now when a packet is read, if the packet type is zero we know it’s a fragment packet, otherwise we run through the ordinary, non-fragmented read packet codepath.

Lets design what this fragment packet looks like. We’ll allow a maximum of 256 fragments per-packet and have a fragment size of 1024 bytes. This gives a maximum packet size of 256k that we can send through this system, which should be enough for anybody, but please don’t quote me on this.

With a small fixed size header, UDP header and IP header a fragment packet be well under the conservative MTU value of 1200. Plus, with 256 max fragments per-packet we can represent a fragment id in the range [0,255] and the total number of fragments per-packet [1,256] with 8 bits.

而且这几乎是“免费的”,因为四种数据包类型刚好可以用 2 位来表示。现在当我们读取一个数据包时,如果数据包类型是 0,我们就知道它是一个分片数据包;否则,就走普通的非分片数据包读取路径。

现在来设计一下这个分片数据包的结构。我们允许每个完整数据包最多有 256 个分片,并将每个分片的大小设为 1024 字节。这意味着我们可以通过该系统发送最大为 256KB 的数据包,这个大小对任何人来说应该都足够了(不过别引用我这句话)。

加上一个小的固定大小的头部、UDP 头和 IP 头后,每个分片数据包的大小仍然可以保持在保守的 MTU 限值 1200 字节以内。而且,由于每个完整数据包最多有 256 个分片,我们可以使用 8 位来表示分片 ID(范围为 [0,255])以及每个数据包的总分片数(范围为 [1,256])。

[protocol id] (32 bits)   // not actually sent, but used to calc crc32
[crc32] (32 bits)
[sequence] (16 bits)
[packet type = 0] (2 bits)
[fragment id] (8 bits)
[num fragments] (8 bits)
[pad zero bits to nearest byte index]
<fragment data>

Notice that we pad bits up to the next byte before writing out the fragment data. Why do this? Two reasons: 1) it’s faster to copy fragment data into the packet via memcpy than bitpacking each byte, and 2) we can now save a small amount of bandwidth by inferring the fragment size by subtracting the start of the fragment data from the total size of the packet.

注意,在写入分片数据之前,我们会将位填充到下一个字节。这么做有两个原因:通过 memcpy 复制分片数据到数据包中比每个字节进行位打包更快,因为 memcpy 是直接内存复制,而位打包需要额外的处理步骤。我们可以通过将分片数据的起始位置从数据包的总大小中减去,来推算出分片的大小,从而节省一些带宽。

Sending Packet Fragments

Sending packet fragments is easy. For any packet larger than conservative MTU, simply calculate how many 1024 byte fragments it needs to be split into, and send those fragment packets over the network. Fire and forget!

发送数据包分片很简单。对于任何大于保守 MTU 的数据包,只需要计算它需要拆分成多少个 1024 字节的分片,然后将这些分片数据包通过网络发送出去。发射出去,忘掉它!

One consequence of this is that if any fragment of that packet is lost then the entire packet is lost. It follows that if you have packet loss then sending a 256k packet as 256 fragments is not a very good idea, because the probability of dropping a packet increases significantly as the number of fragments increases. Not quite linearly, but in an interesting way that you can read more about here.

In short, to calculate the probability of losing a packet, you must calculate the probability of all fragments being delivered successfully and subtract that from one, giving you the probability that at least one fragment was dropped.

这种方式的一个后果是:如果该数据包的任何一个分片丢失,那么整个数据包都会丢失。因此,如果发生了数据包丢失,将一个 256KB 的数据包拆分成 256 个分片并发送并不是一个很好的选择,因为随着分片数量的增加,丢包的概率显著增加。这个增加并不是完全线性的,但有一种有趣的方式可以理解,具体内容可以在这里进一步阅读。

简而言之,要计算丢失数据包的概率,你必须首先计算所有分片都成功传输的概率,然后从 1 中减去这个值,得到至少有一个分片丢失的概率。

1 - probability_of_fragment_being_delivered ^ num_fragments

For example, if we send a non-fragmented packet over the network with 1% packet loss, there is naturally a 1/100 chance the packet will be dropped.

As the number of fragments increase, packet loss is amplified:

分片越多,整体包丢失概率越大

  • Two fragments: 1 - (99/100) ^ 2 = 2%
  • Ten fragments: 1 - (99/100) ^ 10 = 9.5%
  • 100 fragments: 1 - (99/100) ^ 100 = 63.4%
  • 256 fragments: 1 - (99/100) ^ 256 = 92.4%

So I recommend you take it easy with the number of fragments. It’s best to use this strategy only for packets in the 2-4 fragment range, and only for time critical data that doesn’t matter too much if it gets dropped. It’s definitely not a good idea to fire down a bunch of reliable-ordered events in a huge packet and rely on packet fragmentation and reassembly to save your ass.

Another typical use case for large packets is when a client initially joins a game. Here you usually want to send a large block of data down reliably to that client, for example, representing the initial state of the world for late join. Whatever you do, don’t send that block of data down using the fragmentation and re-assembly technique in this article.

Instead, check out the technique in next article which handles packet loss by resending fragments until they are all received.

因此,我建议你不要轻易增加分片的数量。最好只在 2-4 个分片范围内使用这种策略,并且只用于 时间敏感、即使丢包也不太重要的数据。如果你依赖于数据包分片和重组来处理大量可靠顺序事件的传输,这绝对不是一个好主意。

另一个常见的大数据包使用场景是 客户端首次加入游戏时。在这种情况下,你通常希望 可靠地将大量数据发送到客户端,例如,表示 游戏世界的初始状态,以便 迟到的玩家能够赶上。无论如何,千万不要使用本文中的分片和重组技术来传输这些数据块

相反,建议查看下一篇文章中的技术,该方法通过 重新发送分片,直到所有分片都成功接收,来处理数据包丢失问题。

Receiving Packet Fragments

It’s time to implement the code that receives and processed packet fragments. This is a bit tricky because we have to be particularly careful of somebody trying to attack us with malicious packets.

Here’s a list of all the ways I can think of to attack the protocol:

  • Try to send out of bound fragments ids trying to get you to crash memory. eg: send fragments [0,255] in a packet that has just two fragments.

  • Send packet n with some maximum fragment count of say 2, and then send more fragment packets belonging to the same packet n but with maximum fragments of 256 hoping that you didn’t realize I widened the maximum number of fragments in the packet after the first one you received, and you trash memory.

  • Send really large fragment packets with fragment data larger than 1k hoping to get you to trash memory as you try to copy that fragment data into the data structure, or blow memory budget trying to allocate fragments

  • Continually send fragments of maximum size (256/256 fragments) in hope that it I could make you allocate a bunch of memory and crash you out. Lets say you have a sliding window of 256 packets, 256 fragments per-packet max, and each fragment is 1k. That’s 64 mb per-client.

  • Can I fragment the heap with a bunch of funny sized fragment packets sent over and over? Perhaps the server shares a common allocator across clients and I can make allocations fail for other clients in the game because the heap becomes fragmented.

Aside from these concerns, implementation is reasonably straightforward: store received fragments somewhere and when all fragments arrive for a packet, reassemble them into the original packet and return that to the user.

接收数据包分片

现在是实现接收和处理数据包分片代码的时候了。这有点棘手,因为我们需要特别小心有人通过恶意数据包攻击我们。

以下是我能想到的所有可能攻击协议的方式:

  • 尝试发送越界的分片 ID,试图让你崩溃内存。例如:发送一个数据包,包含 [0,255] 的分片,但实际这个数据包只有两个分片。

  • 发送一个数据包(n),其最大分片数为 2,然后再发送更多属于同一个数据包 n 的分片,但这次最大分片数为 256,试图让你没发现自己在收到第一个分片后,实际上修改了分片数上限,从而导致内存损坏。

  • 发送非常大的分片数据包,其中每个分片的大小超过了 1k,试图让你在将分片数据复制到数据结构中时崩溃内存,或者在试图分配更多分片时超出内存限制。

  • 不断发送最大大小的分片(每个数据包有 256 个分片),希望通过占用大量内存导致崩溃。假设你有一个 256 包的滑动窗口,每个数据包最大 256 个分片,每个分片 1k,这就相当于每个客户端占用 64 MB 内存。

  • 能否通过反复发送不同大小的分片数据包来碎片化堆内存?也许服务器与客户端共享相同的内存分配器,我可以通过让堆内存碎片化,从而使得其他客户端的内存分配失败,导致游戏崩溃。

除了这些安全考虑之外,实现本身还是比较直接的:我们只需要将接收到的分片存储在某个地方,当所有分片都接收完毕后,将它们重新组装成原始数据包,并返回给用户。

Data Structure on Receiver Side

The first thing we need is some way to store fragments before they are reassembled. My favorite data structure is something I call a sequence buffer:

我们首先需要一种方法来存储分片,直到它们被重新组装。我最喜欢的数据结构是我称之为“序列缓冲区”(sequence buffer)

const int MaxEntries = 256;

struct SequenceBuffer
{
    uint32_t sequence[MaxEntries];
    Entry entries[MaxEntries];
};

Indexing into the arrays is performed with modulo arithmetic, giving us a fast O(1) lookup of entries by sequence number:

对数组的索引操作是通过模运算进行的,这使得我们可以根据序列号快速地以 O(1) 时间复杂度查找条目。
const int index = sequence % MaxEntries;

A sentinel value of 0xFFFFFFFF is used to represent empty entries. This value cannot possibly occur with 16 bit sequence numbers, thus providing us with a fast test to see if an entry exists for a given sequence number, without an additional branch to test if that entry exists.

This data structure is used as follows. When the first fragment of a new packet comes in, the sequence number is mapped to an entry in the sequence buffer. If an entry doesn’t exist, it’s added and the fragment data is stored in there, along with information about the fragment, eg. how many fragments there are, how many fragments have been received so far, and so on.

Each time a new fragment arrives, it looks up the entry by the packet sequence number. When an entry already exists, the fragment data is stored and number of fragments received is incremented. Eventually, once the number of fragments received matches the number of fragments in the packet, the packet is reassembled and delivered to the user.

Since it’s possible for old entries to stick around (potentially with allocated blocks), great care must be taken to clean up any stale entries when inserting new entries in the sequence buffer. These stale entries correspond to packets that didn’t receive all fragments.

And that’s basically it at a high level. For further details on this approach please refer to the example source code for this article. 

一个哨兵值 0xFFFFFFFF 被用来表示空条目。由于 16 位序列号无法出现该值,这提供了一种快速的方式来检查某个序列号的条目是否存在,而无需额外的分支来判断该条目是否存在。

这个数据结构的使用方式如下:当一个新数据包的第一个分片到达时,序列号会映射到序列缓冲区中的一个条目。如果该条目不存在,就会被添加,并且将分片数据存储在该条目中,此外,还会存储分片的信息,比如总共有多少个分片,已接收多少个分片,等等。

每次新分片到达时,它会根据数据包的序列号查找该条目。如果条目已经存在,分片数据会被存储,并且已接收的分片数量会递增。最终,一旦接收到的分片数量与数据包中的总分片数匹配,数据包就会被重新组装并交付给用户。

由于旧的条目有可能仍然存在(可能包含已分配的内存块),在插入新条目时必须小心处理清理过期条目。这些过期条目对应于那些没有接收到所有分片的数据包。

总体来说,整个过程就是这样。对于这种方法的进一步细节,请参考本文提供的示例源代码。

Test Driven Development

One thing I’d like to close this article out on.

Writing a custom UDP network protocol is hard. It’s so hard that even though I’ve done this from scratch at least 10 times, each time I still manage to fuck it up in a new and exciting ways. You’d think eventually I’d learn, but this stuff is complicated. You can’t just write low-level netcode and expect it to just work.

You have to test it!

测试驱动开发
在文章的结尾,我想强调一件事。

编写一个自定义的 UDP 网络协议非常困难。难到什么程度呢?就算我已经从零开始写过至少十次,每次还是能以一种新奇又令人抓狂的方式把它搞砸。你可能会想,我总该学乖了吧?但事实是,这玩意儿真的很复杂。你不能指望写出一套底层网络代码,它就能“正常工作”。

必须测试它!

My strategy when testing low-level netcode is as follows:

  • Code defensively. Assert everywhere. These asserts will fire and they’ll be important clues you need when something goes wrong.

  • Add functional tests and make sure stuff is working as you are writing it. Put your code through its paces at a basic level as you write it and make sure it’s working as you build it up. Think hard about the essential cases that need to be property handled and add tests that cover them.

  • But just adding a bunch of functional tests is not enough. There are of course cases you didn’t think of! Now you have to get really mean. I call this soak testing and I’ve never, not even once, have coded a network protocol that hasn’t subsequently had problems found in it by soak testing.

  • When soak testing just loop forever and just do a mix of random stuff that puts your system through its paces, eg. random length packets in this case with a huge amount of packet loss, out of order and duplicates through a packet simulator. Your soak test passes when it runs overnight and doesn’t hang or assert.

  • If you find anything wrong with soak testing. You may need to go back and add detailed logs to the soak test to work out how you got to the failure case. Once you know what’s going on, stop. Don’t fix it immediately and just run the soak test again.

  • Instead, add a unit test that reproduces that problem you are trying to fix, verify your test reproduces the problem, and that it problem goes away with your fix. Only after this, go back to the soak test and make sure they run overnight. This way the unit tests document the correct behavior of your system and can quickly be run in future to make sure you don’t break this thing moving forward when you make other changes.

  • Add a bunch of logs. High level errors, info asserts showing an overview of what is going on, but also low-level warnings and debug logs that show what went wrong after the fact. You’re going to need these logs to diagnose issues that don’t occur on your machine. Make sure the log level can be adjusted dynamically.

  • Implement network simulators and make sure code handles the worst possible network conditions imaginable. 99% packet loss, 10 seconds of latency and +/- several seconds of jitter. Again, you’ll be surprised how much this uncovers. Testing is the time where you want to uncover and fix issues with bad network conditions, not the night before your open beta.

  • Implement fuzz tests where appropriate to make sure your protocol doesn’t crash when processing random packets. Leave fuzz tests running overnight to feel confident that your code is reasonably secure against malicious packets and doesn’t crash.

  • Surprisingly, I’ve consistently found issues that only show up when I loop the set of unit tests over and over, perhaps these issues are caused by different random numbers in tests, especially with the network simulator being driven by random numbers. This is a great way to take a rare test that fails once every few days and make it fail every time. So before you congratulate yourself on your tests passing 100%, add a mode where your unit tests can be looped easily, to uncover such errors.

  • Test simultaneously on multiple platforms. I’ve never written a low-level library that worked first time on MacOS, Windows and Linux. There are always interesting compiler specific issues and crashes. Test on multiple platforms as you develop, otherwise it’s pretty painful fixing all these at the end.

  • Think about how people can attack the protocol. Implement code to defend against these attacks. Add functional tests that mimic these attacks and make sure that your code handles them correctly.

我在测试底层网络代码时的策略如下:

防御式编写代码。到处加断言。这些断言会触发,而且当出问题时,它们会成为你需要的重要线索。

添加功能性测试,并在编写代码时确保功能正常。边写边在基本层面上验证你的代码是否能运行,并确保它在构建过程中保持正常。认真思考那些必须正确处理的关键情况,并为它们编写覆盖测试。

但仅仅添加一堆功能测试还不够。一定有你没想到的情况!这时你需要变得“狠一点”。我称之为浸泡测试(soak testing),我从来没有写过一个网络协议,之后没有通过浸泡测试发现问题的。

进行浸泡测试时,就让程序无限循环,混合执行各种随机操作,对你的系统进行压力测试,例如在这种情况下,用一个数据包模拟器模拟大量丢包、乱序和重复的数据包,以及随机长度的包。当它可以通宵运行且不会卡住或触发断言时,说明你的浸泡测试通过了。

如果你在浸泡测试中发现了问题,你可能需要回去给测试添加详细日志,以便分析是怎么导致这个错误的。一旦你弄清楚发生了什么,先别修复。不要立刻修,而是再次运行浸泡测试。

相反,先添加一个能重现你要修复问题的单元测试,确认你的测试确实能复现该问题,并且在你修复之后问题会消失。只有在这之后,再去跑你的浸泡测试,确保它能通宵跑完。这样,单元测试就记录了系统的正确行为,未来你在做其他修改时可以快速运行这些测试,确保没有引入新问题。

加一堆日志。包括高层级的错误信息、展示系统概况的断言信息,也包括低层级的警告和调试日志,用于在事后定位问题。你会需要这些日志来诊断那些在你机器上无法重现的问题。确保日志等级可以动态调整。

实现网络模拟器,并确保代码可以处理你能想象到的最糟糕的网络条件。比如 99% 的丢包,10 秒的延迟,以及 ± 几秒的抖动。你会惊讶地发现这能揭示出许多问题。测试阶段就是发现并修复这些网络劣化带来的问题的最好时机,而不是在公开测试前一晚才修修补补。

在合适的地方实现模糊测试(fuzz test),以确保你的协议在处理随机数据包时不会崩溃。把模糊测试放着通宵跑,以增强信心:你的代码在面对恶意数据包时也不会崩。

出人意料地,我经常发现一些问题,只有在把一组单元测试循环执行多次时才会暴露出来,也许这是因为测试里使用了不同的随机数,尤其是当网络模拟器是由随机数驱动的时候。这是一个很好的方法,可以让那些每几天才偶发一次的测试失败变成每次都失败。因此,在你为测试 100% 通过而庆祝之前,加入一个可以轻松循环执行单元测试的模式,用来揭露这类错误。

在多个平台上同时测试。我从没写过一个底层库,能第一次就跑在 macOS、Windows 和 Linux 上。总会有一些编译器特有的问题和崩溃。开发过程中就要在多个平台上测试,否则最后再统一修复会非常痛苦。

思考人们可能会如何攻击你的协议。实现防御这些攻击的代码。添加模拟这些攻击的功能测试,并确保你的代码能够正确处理它们。

 

This is my process and it seems to work pretty well. If you are writing a low-level network protocol, the rest of your game depends on this code working correctly. You need to be absolutely sure it works before you build on it, otherwise it’s basically a stack of cards.

In my experience, game neworking is hard enough without having suspicions that that your low-level network protocol has bugs that only show up under extreme network conditions. That’s exactly where you need to be able to trust your code works correctly. So test it!

这是我的流程,而且效果相当不错。如果你正在编写一个底层的网络协议,那你游戏的其他部分都依赖于这段代码能否正常工作。你必须绝对确定它是可靠的,再在此基础上继续构建。否则,它就像是一座纸牌塔,随时可能倒塌。

根据我的经验,游戏网络开发本身就已经够难了,如果你还要担心底层网络协议在极端网络条件下才会暴露出 bug,那就更麻烦了。而恰恰是在这种情况下,你必须能信赖你的代码是稳定的。

所以——测试它!






 

posted @ 2025-04-04 23:11  sun_dust_shadow  阅读(55)  评论(0)    收藏  举报