在Erlang中写处理二进制数据的代码是洋溢着幸福感的,它对于二进制强大的表现力甚至能让你忘掉了它种种不便,今天我们说说Erlang的二进制数据处理。

 Erlang中bit string代表无类型的内存区域,bit string 使用比特语法表达,如果bit string包含的数据位数是8的整数倍被称为二进制Binary数据.不要小看一个bits或者说bit string,可以让我们指定任意数据位的bit,这在协议解析的时候是多么的遍历,可以设想一下假如没有这样一个基础设施,我们解析二进制协议的时候将是一种什么样的情况(Erlang之前好像就没有提供bits,还好,还好,等我们用的时候已经有了).

 

我们先从erlang shell中编写一系列demo开始,注意=号前后输入一下空格避免语法错误:

1> Bin1 = <<1,17,42>>. %尝试一下<<M:8,P:8,Q:8>>   =   <<1,17,42>>.
<<1,17,42>>
2> Bin2 = <<"abc">>. %尝试一下 <<"ABC">> == <<"A","B","C">>.
<<97,98,99>> % 尝试一下 <<"abc"/utf8>> == <<$a/utf8,$b/utf8,$c/utf8>>.
3> Bin3 = <<1,17,42:16>>.
<<1,17,0,42>>
4> <<A,B,C:16>> = <<1,17,42:16>>.
<<1,17,0,42>>
5> C.
42
6> <<D:16,E,F>> = <<1,17,42:16>>. % 256*1+17=273
<<1,17,0,42>>
7> D.
273
8> F.
42
9> <<G,H/binary>> = <<1,17,42:16>>.
<<1,17,0,42>>
10> H.
<<17,0,42>>
%%释放变量 f().
11> <<G,H/bitstring>> = <<1,17,42:12>>.
<<1,17,1,10:4>>
12> H.
<<17,1,10:4>>
13> <<1024/utf8>>.
<<208,128>>

14> << P,Q/bitstring >> = <<1:1,12:7,3:3>>.
<<140,3:3>>

15> << 1:1,0:3>>.
<<8:4>>

16> <<B1/binary,B2/binary>> = << 8,16>>.
* 1: a binary field without size is only allowed at the end of a binary pattern

这里产生异常是因为我们没有给B1指定长度:

In matching, this default value is only valid for the very last element. All other bit string or binary elements in the matching must have a size specification.

我们修改代码重试:

27> <<B1:2/binary,B2/binary>> = << 8,16>>.
<<8,16>>
28> B1.
<<8,16>>
29> B2.
<<>>
30> <<B3:1/binary,B4/binary>> = << 8,16>>.
<<8,16>>
31> B3.
<<"\b">>
32> B4.
<<16>>

我们甚至可以尝试一下Bit String Comprehensions,对比列表解析,这是不难理解:

33> << <<(X*2)>> || <<X>> <= <<1,2,3>> >>.
<<2,4,6>>

 

上面提到了bits,这个在网络协议以及文件格式解析的时候很常见,比如mochi的erl_image项目, https://github.com/mochi/erl_img/blob/master/src/image_gif.erl

 

read(Fd,IMG,RowFun,St0) ->
    file:position(Fd, 6),
    case file:read(Fd, 7) of
        {ok, <<_Width:16/little, _Hight:16/little,
              Map:1, _Cr:3, Sort:1, Pix:3,
              Background:8,
              AspectRatio:8>>} ->
            Palette = read_palette(Fd, Map, Pix+1),
            ?dbg("sizeof(palette)=~p Map=~w, Cr=~w, Sort=~w, Pix=~w\n",
                 [length(Palette),Map,_Cr,Sort,Pix]),
            ?dbg("Background=~w, AspectRatio=~w\n",
                 [Background, AspectRatio]),
            As = [{'Background',Background},
                  {'AspectRatio',AspectRatio},
                  {'Sort',Sort} | IMG#erl_image.attributes],
            IMG1 = IMG#erl_image { palette = Palette, attributes = As},
            read_data(Fd, IMG1, RowFun, St0, []);
        Error ->
            Error
    end.

  

 


Bit Syntax Expressions


    写完这些demo,我们对Erlang的强大的二进制数据表现能力有了一个基本的了解,下面是比特语法的规格说明,这里面还是有一些基础的问题需要明确一下:

<<>>
<<E1,...,En>>

Ei = Value |
Value:Size |
Value/TypeSpecifierList |
Value:Size/TypeSpecifierList



Type= integer | float | binary | bytes | bitstring | bits | utf8 | utf16 | utf32

Signedness= signed | unsigned (整型值时有意义,默认是unsigned)

Endianness= big | little | native 默认是big

Unit= unit:IntegerLiteral

  unit是每个数据段的size,允许的取值范围是1..256
  size和unit的乘积是数据占用的二进制位数,且必须可以被8整除
  unit 通常用来保证字节对齐.


 Type类型默认值是整型. bytes是binary简写形式;bits是bitstring的简写形式.

         类型说明二进制数据如何使用,数据的使用方式决定了数据的意义



Unit取值范围是1..256, integer float bitstring默认值是1,binary默认值是8. utf8, utf16, and utf32不需要指定类型规格说明.

Size指定数据段的单位(unit).默认值和类型有关:整型-8位 浮点型-64位

Endianness默认值是big;大端小端只有是Type是整形,utf16,utf32,float的时候有意义.这个决定了二进制数据如何读取.还有一种是选择native选项,这个是运行时根据CPU的情况选择大端还是小端.



数据区所占用bit位数的计算方法:  Size * unit =bit 位数

 <<25:4/unit:8>> .
<<0,0,0,25>>
7>  <<25:2/unit:16>> .
<<0,0,0,25>>
8>  <<25:1/unit:32>> .
<<0,0,0,25>>

  



TypeSpecifierList类型规格说明列表使用中横线连接(-).任何省略类型规格说明的都会使用默认值.

 
对于上面的规格说明,会有两个问题:

 【问题1】Type里面提到了utf8 utf16 utf32,翻阅文档可以看到下面的说明,这些说明怎么理解?

For the utf8, utf16, and utf32 types, Size must not be given. The size of the segment is implicitly determined by the type and value itself.

For utf8, Value will be encoded in 1 through 4 bytes. For utf16, Value will be encoded in 2 or 4bytes. Finally, for utf32, Value will always be encoded in 4 bytes.

 【问题2】endianness 是什么?什么是 big-endian?什么是little-endian?
Native-endian means that the endianness will be resolved at load time to be either big-endian or little-endian,
depending on what is native for the CPU that the Erlang machine is run on. Endianness only matters when the Type is either integer, utf16, utf32, or float. The default is big.
 

 第一个问题:

  还是从头理一下这个问题吧,感谢维基百科详尽的讲解:

   一个字节byte有8位表达256种状态,我们最熟悉的ASCII(American Standard Code for Information Interchange)是最通用的单字节编码系统规定了128个字符和二进制数据位之间的关系,由于只需要一个字节的后面7位就可以实现所以最前面一位为0;并没有。在Erlang中我们可以通过$符号取字符的ASCII值.维基百科ASCII: http://zh.wikipedia.org/wiki/Ascii

      从维基百科中的描述可以看到ASCII的局限性,它的表达能力仅限于现代英语,对于其它语言表达能力是显然不足的。首先被想到的就是利用默认的最高位来表达更多符号,这样做的结果就是0-127表示的符号是相同的,128~256根据语言不同表示的符号不同。这种简单的扩展方案被称为EASCII,它勉强可以表达西欧语言。EASCII的故事看这里:http://zh.wikipedia.org/wiki/EASCII

    看到EASCII方案的时候很容易想到中文字符的表达,中文字符数量巨大,一个字节显然是无能为力的表达的。中文编码我们最熟悉的是GB2312编码,它使用两个字节表达了6763个汉字覆盖了中国大陆99.75%的使用字,人名和古汉语中的罕见字以及繁体字是没有覆盖到的,这也就催生了GBK GB18030编码。记得我们大学同学里有一个叫孟龑的,“龑”字就不在GB2312'名录中。GB2312的八卦看这里:http://zh.wikipedia.org/wiki/Gb2312

     同样的二进制数据按照不同的编码规范去解析得出的符号结果也是不同的,如果使用错误的编码方式去解读就会出现乱码,一个理想的解决方案就是采用一种统一的编码规范。Unicode(统一码、万国码、单一码、标准万国码)是计算机科学领域里的一项业界标准,用以统一地体现和处理世界上大部分的文字系统,并为其编码。读过前面的资料,这里可能产生一个疑问:Unicode会使用几个字节表示字符呢?Unicode只规定符号的编码,不规定如何表达。一个字符的Unicode编码是确定的。但是在实际传输过程中,由于不同系统平台的设计不一定一致,以及出于节省空间的目的,对Unicode编码的实现方式有所不同。Unicode的实现方式称为Unicode转换格式(Unicode Transformation Format,简称为UTF),UTF-8就是转换格式之一。

  UTF-8采用变长字节存储Unicode.如果一个仅包含基本7位ASCII字符的Unicode文件,如果每个字符都使用2字节的原Unicode编码传输,其第一字节的8位始终为0。这就造成了比较大的浪费。对于这种情况,可以使用UTF-8编码,这是一种变长编码,它将基本7位ASCII字符仍用7位编码表示,占用一个字节(首位补0)。而遇到与其他Unicode字符混合的情况,将按一定算法转换,每个字符使用1-3个字节编码,并利用首位为0或1进行识别。这样对以7位ASCII字符为主的西文文档就大大节省了编码长度(具体方案参见UTF-8)。Unicode 编码:http://zh.wikipedia.org/wiki/Unicode UTF-8编码:http://zh.wikipedia.org/wiki/UTF-8

   这里第一问题的答案已经有了,由于是变长编码,类型和值决定了占用的字节数。顺便提一下曾经遇到过的文件生成时BOM的问题:

   什么是BOM?字节顺序记号(英语:byte-order mark,BOM)是位于码点U+FEFF的统一码字符的名称。当以UTF-16或UTF-32来将UCS/统一码字符所组成的字符串编码时,这个字符被用来标示其字节序。它常被用来当做标示文件是以UTF-8、UTF-16或UTF-32编码的记号。这里有篇文章已经讨论C#解决BOM的问题(如何读,写,去掉BOM):http://www.cnblogs.com/mgen/archive/2011/07/13/2105649.html

第二个问题:

   维基百科上关于Endianness的资料:http://zh.wikipedia.org/wiki/%E5%AD%97%E8%8A%82%E5%BA%8F 比较有意思的是它的词源来自于格列佛游记。小说中,小人国为水煮蛋该从大的一端(Big-End)剥开还是小的一端(Little-End)剥开而争论,争论的双方分别被称为Big-endians和Little-endians。1980年,Danny Cohen在其著名的论文"On Holy Wars and a Plea for Peace"中,为平息一场关于字节该以什么样的顺序传送的争论,而引用了该词。

Endianness字节序,又称端序,尾序。在计算机科学领域中,字节序是指存放多字节数据的字节(byte)的顺序,典型的情况是整数在内存中的存放方式和网络传输的传输顺序。Endianness有时候也可以用指位序(bit)。

一般而言,字节序指示了一个UCS-2字符的哪个字节存储在低地址。如果LSByte在MSByte的前面,即LSB为低地址,则该字节序是小端序;反之则是大端序。在网络编程中,字节序是一个必须被考虑的因素,因为不同的处理器体系可能采用不同的字节序。在多平台的代码编程中,字节序可能会导致难以察觉的bug。网络传输一般采用大端序,也被称之为网络字节序,或网络序。IP协议中定义大端序为网络字节序。

   回到Erlang文档中的那段文字我们就可以理解了:网络字节序采用big-endians,erlang比特语法字节序默认值也是big,也就是说在做网络协议实现的时候我们不需要显示指定该选项。Native-endian的意思是运行时决定字节序。

  字节序的问题是一个公共问题,看老赵的这篇文章:浅谈字节序(Byte Order)及其相关操作

最后贴一段Erlang群里面常问到的一个问题,如何解析从二进制数据中解析字符串:

read_string(Bin) ->
case Bin of
<<Len:16, Bin1/binary>> ->
case Bin1 of
<<Str:Len/binary-unit:8, Rest/binary>> ->
{binary_to_list(Str), Rest};
_R1 ->
{[],<<>>}
end;
_R1 ->
{[],<<>>}
end.

二进制的内部实现

  • binary和bitstring内部实现机制相同
  • Erlang内部有四种二进制类型,两种容器,两种引用
  • 容器有refc binaries 和 heap binaries
  • refc binaries又可以分成两部分存放在进程堆(process heap)的ProcBin,ProcBin是一个二进制数据的元数据信息,包含了二进制数据的位置和引用计数,进程堆以外的二进制对象
  • 游离在进程堆之外的二进制对象可以被任意数量的进程和任意数量的ProcBin引用,该对象包含了引用计数器,一旦计数器归零就可以移除掉
  • 所有的ProcBin对象都是链表的一部分,所以GC跟踪它们并在ProcBin消失的时候将应用计数减一
  • heap binaries 都是小块二进制数据,最大64字节,直接存放在进程堆(process heap),垃圾回收和发送消息都是通过拷贝实现,不需要垃圾回收器做特殊处理
  • 引用类型有两种:sub binaries , match contexts
  • sub binary是split_binary的时候产生的,sub binary是另外一个二进制数据的部分应用(refc 或者 heap binary),由于并没有数据拷贝所以binary的模式匹配成本相当低
  • match context类似sub binary,但是针对二进制匹配做了优化;例如它包含一个直接指向二进制数据的指针.从二进制匹配出来字段值之后移动指针位置即可.

官方文档链接:http://www.erlang.org/doc/efficiency_guide/binaryhandling.html

Internally, binaries and bitstrings are implemented in the same way.

There are four types of binary objects internally. Two of them are containers for binary data and two of them are merely references to a part of a binary.

 
The binary containers are called refc binaries (short for reference-counted binaries) and heap binaries.
Refc binaries consist of two parts: an object stored on the process heap, called a ProcBin, and the binary object itself stored outside all process heaps.The binary object can be referenced by any number of ProcBins from any number of processes; the object contains a reference counter to keep track of the number of references, so that it can be removed when the last reference disappears.

All ProcBin objects in a process are part of a linked list, so that the garbage collector can keep track of them and decrement the reference counters in the binary when a ProcBin disappears.

Heap binaries are small binaries, up to 64 bytes, that are stored directly on the process heap. They will be copied when the process is garbage collected and when they are sent as a message. They don't require any special handling by the garbage collector.

There are two types of reference objects that can reference part of a refc binary or heap binary. They are called sub binaries and match contexts.

A sub binary is created by split_binary/2 and when a binary is matched out in a binary pattern. A sub binary is a reference into a part of another binary (refc or heap binary, never into a another sub binary). Therefore, matching out a binary is relatively cheap because the actual binary data is never copied.
A match context is similar to a sub binary, but is optimized for binary matching; for instance, it contains a direct pointer to the binary data. For each field that is matched out of a binary, the position in the match context will be incremented.

 

Endianness
Possible values: big | little | native
Endianness only matters when the Type is either integer, utf16, utf32, or float. This has to do with how the system reads binary data. As an example, the BMP image header format holds the size of its file as an integer stored on 4 bytes. For a file that has a size of 72 bytes, a little-endian system would represent this as <<72,0,0,0>> and a big-endian one as <<0,0,0,72>>. One will be read as '72' while the other will be read as '1207959552', so make sure you use the right endianness. There is also the option to use 'native', which will choose at run-time if the CPU uses little-endianness or big-endianness natively. By default, endianness is set to 'big'.
Unit
written unit:Integer
This is the size of each segment, in bits. The allowed range is 1..256 and is set by default to 1 for integers, floats and bit strings and to 8 for binary. The utf8, utf16 and utf32 types require no unit to be defined. The multiplication of Size by Unit is equal to the number of bits the segment will take and must be evenly divisible by 8. The unit size is usually used to ensure byte-alignment.
The TypeSpecifierList is built by separating attributes by a '-'.

UTF-8的资料:http://www.zehnet.de/2005/02/12/unicode-utf-8-tutorial/

 

 

An Essay on Endian Order

http://people.cs.umass.edu/~verts/cs32/endian.html
Copyright (C) Dr. William T. Verts, April 19, 1996

Depending on which computing system you use, you will have to consider the byte order in which multibyte numbers are stored, particularly when you are writing those numbers to a file. The two orders are called "Little Endian" and "Big Endian".

The Basics

"Little Endian" means that the low-order byte of the number is stored in memory at the lowest address, and the high-order byte at the highest address. (The little end comes first.) For example, a 4 byte LongInt

    Byte3 Byte2 Byte1 Byte0

will be arranged in memory as follows:

    Base Address+0   Byte0
    Base Address+1   Byte1
    Base Address+2   Byte2
    Base Address+3   Byte3

Intel processors (those used in PC's) use "Little Endian" byte order.

"Big Endian" means that the high-order byte of the number is stored in memory at the lowest address, and the low-order byte at the highest address. (The big end comes first.) Our LongInt, would then be stored as:

    Base Address+0   Byte3
    Base Address+1   Byte2
    Base Address+2   Byte1
    Base Address+3   Byte0

Motorola processors (those used in Mac's) use "Big Endian" byte order.

Which is Better?

You may see a lot of discussion about the relative merits of the two formats, mostly religious arguments based on the relative merits of the PC versus the Mac. Both formats have their advantages and disadvantages.

In "Little Endian" form, assembly language instructions for picking up a 1, 2, 4, or longer byte number proceed in exactly the same way for all formats: first pick up the lowest order byte at offset 0. Also, because of the 1:1 relationship between address offset and byte number (offset 0 is byte 0), multiple precision math routines are correspondingly easy to write.

In "Big Endian" form, by having the high-order byte come first, you can always test whether the number is positive or negative by looking at the byte at offset zero. You don't have to know how long the number is, nor do you have to skip over any bytes to find the byte containing the sign information. The numbers are also stored in the order in which they are printed out, so binary to decimal routines are particularly efficient.

What does that Mean for Us?

What endian order means is that any time numbers are written to a file, you have to know how the file is supposed to be constructed. If you write out a graphics file (such as a .BMP file) on a machine with "Big Endian" integers, you must first reverse the byte order, or a "standard" program to read your file won't work.

The Windows .BMP format, since it was developed on a "Little Endian" architecture, insists on the "Little Endian" format. You must write your Save_BMP code this way, regardless of the platform you are using.

Common file formats and their endian order are as follows:

  • Adobe Photoshop -- Big Endian
  • BMP (Windows and OS/2 Bitmaps) -- Little Endian
  • DXF (AutoCad) -- Variable
  • GIF -- Little Endian
  • IMG (GEM Raster) -- Big Endian
  • JPEG -- Big Endian
  • FLI (Autodesk Animator) -- Little Endian
  • MacPaint -- Big Endian
  • PCX (PC Paintbrush) -- Little Endian
  • PostScript -- Not Applicable (text!)
  • POV (Persistence of Vision ray-tracer) -- Not Applicable (text!)
  • QTM (Quicktime Movies) -- Little Endian (on a Mac!)
  • Microsoft RIFF (.WAV & .AVI) -- Both
  • Microsoft RTF (Rich Text Format) -- Little Endian
  • SGI (Silicon Graphics) -- Big Endian
  • Sun Raster -- Big Endian
  • TGA (Targa) -- Little Endian
  • TIFF -- Both, Endian identifier encoded into file
  • WPG (WordPerfect Graphics Metafile) -- Big Endian (on a PC!)
  • XWD (X Window Dump) -- Both, Endian identifier encoded into file

Correcting for the Non-Native Order

It is pretty easy to reverse a multibyte integer if you find you need the other format. A single function can be used to switch from one to the other, in either direction. A simple and not very efficient version might look as follows: Function Reverse (N:LongInt) : LongInt ; Var B0, B1, B2, B3 : Byte ; Begin B0 := N Mod 256 ; N := N Div 256 ; B1 := N Mod 256 ; N := N Div 256 ; B2 := N Mod 256 ; N := N Div 256 ; B3 := N Mod 256 ; Reverse := (((B0 * 256 + B1) * 256 + B2) * 256 + B3) ; End ; A more efficient version that depends on the presence of hexadecimal numbers, bit masking operators AND, OR, and NOT, and shift operators SHL and SHR might look as follows: Function Reverse (N:LongInt) : LongInt ; Var B0, B1, B2, B3 : Byte ; Begin B0 := (N AND $000000FF) SHR 0 ; B1 := (N AND $0000FF00) SHR 8 ; B2 := (N AND $00FF0000) SHR 16 ; B3 := (N AND $FF000000) SHR 24 ; Reverse := (B0 SHL 24) OR (B1 SHL 16) OR (B2 SHL 8) OR (B3 SHL 0) ; End ; There are certainly more efficient methods, some of which are quite machine and platform dependent. Use what works best.

 


重新排版了一下,晚安!