上面两篇文章都提到了二进制数据相关的内容,今天阅读了一下官方文档中关于二进制实现的内容,整理笔记于此.先看一张图:

 

 

思维导图解释

  • binary和bitstring内部实现机制相同
  • Erlang内部有四种二进制类型,两种容器,两种引用
  • 容器有refc binaries (引用计数二进制)和 heap binaries
  • refc binaries又可以分成两部分存放在进程堆(process heap)的ProcBin和进程堆以外的二进制对象
  • ProcBin包含一个二进制数据的元数据信息,包含了二进制数据的位置和引用计数
  • 游离在进程堆之外的二进制对象可以被任意数量的进程和任意数量的ProcBin引用,该对象包含了引用计数器,一旦计数器归零就可以移除掉
  • 所有的ProcBin对象都是链表(Linked List)的一部分,所以GC跟踪它们并在ProcBin消失的时候将应用计数减一
  • heap binaries 都是小块二进制数据,最大64字节,直接存放在进程堆(process heap),垃圾回收和发送消息都是通过拷贝实现,不需要垃圾回收器做特殊处理
  • 有两种可以引用refc binary和heap binary部分数据的对象:sub binaries , match contexts
  • sub binary是split_binary的时候产生的,sub binary是另外一个二进制数据的部分应用(refc 或者 heap binary),由于并没有数据拷贝所以binary的模式匹配成本相当低
  • match context类似sub binary,但是针对二进制匹配做了优化;例如它包含一个直接指向二进制数据的指针.从二进制匹配出来字段值之后移动指针位置即可.

 

2014-8-28 16:28:46更新  erts_debug获取二进制数据的信息

10> erts_debug:set_internal_state(available_internal_state, true). 

=ERROR REPORT==== 28-Aug-2014::16:23:08 ===
Process <0.49.0> enabled access to the emulator internal state.
NOTE: This is an erts internal test feature and should *only* be used by OTP test-suites.

false

16> erts_debug:get_internal_state({binary_info,<<"sasdasd">>}).
{refc_binary,7,{binary,256},3}
17> erts_debug:get_internal_state({binary_info,<<"sasdasdsdfsd">>}).
{refc_binary,12,{binary,256},3}
18> 

  

 

2014-2-27 20:56:30补充更新

match context VS sub binary


my_binary_to_list(<<H,T/binary>>) ->
    [H|my_binary_to_list(T)];
my_binary_to_list(<<>>) -> [].

  

   第一次调用my_binary_to_list/1会创建match context,指向该二进制数据的第一个字节,一个字节match完成之后,指针移动到第二个字节;在R12B之前的版本会创建sub binary,在R12B编译器发现没有必要创建sub binary因为之后马上进行方法调用sub binary被马上丢弃掉,所以R12B my_binary_to_list/1选择了只创建match context,后续匹配操作发现传递过来的是match context就不再创建什么sub binary.匹配到二进制结束,执行到了第二个逻辑分支,就会抛弃掉match context(引用计数为0,下一次GC的时候移除).总结一下,在R12B只需要创建一个match context,不需要创建sub binary.在之前的版本,如果该二进制数据有N个字节,需要创建N+1个match context和N个sub binary.这样在之前的版本里面需要掌握的编程技巧就是避免创建sub binary,怎么避免?在函数调用的时候避免使用二进制match,然后手工模拟指针移动即可,这个方法很巧妙的避免了sub binary的创建,但是没有解决match context的创建(还是N+1个):


my_complicated_binary_to_list(Bin) ->
    my_complicated_binary_to_list(Bin, 0).

my_complicated_binary_to_list(Bin, Skip) ->
    case Bin of
        <<_:Skip/binary,Byte,_/binary>> ->
            [Byte|my_complicated_binary_to_list(Bin, Skip+1)];
        <<_:Skip/binary>> ->
            []
    end.

  


上面提到如果match遍历到数据结束就抛弃match context,如果是中途停止迭代,比如下面这个,编译器会会在第一个分支创建sub binary,移除第二个分支的sub binary
 
after_zero(<<0,T/binary>>) ->
    T;
after_zero(<<_,T/binary>>) ->
    after_zero(T);
after_zero(<<>>) ->
    <<>>.

  

 
 
编译器能做的是如果它能确定binary不会被共享就延迟创建sub binary.
 
是否经过优化了,不需要人肉判断,通过bin_opt_info选项就可以输出相关信息:

erlc +bin_opt_info Mod.erl
或者
export ERL_COMPILER_OPTIONS=bin_opt_info
 
 
 

什么时候会触发二进制数据的强制拷贝?

 

   

在Erlang中,下面的代码在R12B之前是有问题的:

my_list_to_binary(List) ->
    my_list_to_binary(List, <<>>).

my_list_to_binary([H|T], Acc) ->
    my_list_to_binary(T, <<Acc/binary,H>>);
my_list_to_binary([], Acc) ->
    Acc.

 在R12B之前,ACC会在每次迭代中复制,在R12B及其之后的版本中只是在第一次被复制,然后分配额外的空间,每一次都在其后面追加.后续迭代的时候,H就会写入额外的空间.额外空间用尽之后,会再次触发分配额外的空间.分配的策略是binary data的两倍大小,或者是256.
目前最快,最自然的方法是这样:

my_binary_to_list(<<H,T/binary>>) ->
    [H|my_binary_to_list(T)];
my_binary_to_list(<<>>) -> [].

 

   要求Single ProcBin或者Single Reference的情况,答案也很简单:二进制数据追加只需要继续分配空间移动指针即可,如果有多个指向ProcBin的指针,这种关系就很难维护了;有些情况下会触发强制拷贝,大多数情况下,二进制数据对象会同时收缩来回收之前预分配的空间;

看下面的例子:

Bin = <<Bin0,....>>

上面这个操作继续在Bin上面进行追加是低成本的,如果继续追加Bin0会导致强制拷贝一份Bin0的值;如果二进制数据对象作为消息发送到其它进程或者port,binary也会进行收缩处理,后续如果有追加操作也会复制新副本.


Bin1 = <<Bin0,...>>,
PortOrPid ! Bin1,
Bin = <<Bin1,...>>  %% Bin1 will be COPIED
这里Bin1就会使用复制的副本;同样的情况,插入二进制数据到ETS表或者通过erlang:port_command/2发送出去也会触发复制;

 

二进制数据进行match的时候也会触发收缩,下一个追加操作会触发数据拷贝:


Bin1 = <<Bin0,...>>,
<<X,Y,Z,T/binary>> = Bin1,
Bin = <<Bin1,...>>  %% Bin1 will be COPIED

这样处理的原因是match context里面包含了指向二进制数据对象的指针.

 

如果一个进程一直持有二进制数据比如放在loop data或者在进程字典,GC可能最终会选择收缩这些数据.如果只有一份这种数据,就不会触发收缩.如果进程之后在已经收缩过的数据上进行追加,二进制对象会重新分配空间用于数据追加;

 
 
 
检查进程的Bin相关信息:
 
 
{backtrace, Bin}
The binary Bin contains the same information as the output from erlang:process_display(Pid, backtrace). Use binary_to_list/1 to obtain the string of characters from the binary.

{binary, BinInfo}
BinInfo is a list containing miscellaneous information about binaries currently being referred to by this process. This InfoTuple may be changed or removed without prior notice.

{min_bin_vheap_size, MinBinVHeapSize}
MinBinVHeapSize is the minimum binary virtual heap size for the process.
 
 

官方文档链接:http://www.erlang.org/doc/efficiency_guide/binaryhandling.html

Internally, binaries and bitstrings are implemented in the same way.

There are four types of binary objects internally. Two of them are containers for binary data and two of them are merely references to a part of a binary.

 
The binary containers are called refc binaries (short for reference-counted binaries) and heap binaries.
Refc binaries consist of two parts: an object stored on the process heap, called a ProcBin, and the binary object itself stored outside all process heaps.The binary object can be referenced by any number of ProcBins from any number of processes; the object contains a reference counter to keep track of the number of references, so that it can be removed when the last reference disappears.
All ProcBin objects in a process are part of a linked list, so that the garbage collector can keep track of them and decrement the reference counters in the binary when a ProcBin disappears.

Heap binaries are small binaries, up to 64 bytes, that are stored directly on the process heap. They will be copied when the process is garbage collected and when they are sent as a message. They don't require any special handling by the garbage collector.

There are two types of reference objects that can reference part of a refc binary or heap binary. They are called sub binaries and match contexts.

A sub binary is created by split_binary/2 and when a binary is matched out in a binary pattern. A sub binary is a reference into a part of another binary (refc or heap binary, never into a another sub binary). Therefore, matching out a binary is relatively cheap because the actual binary data is never copied.
A match context is similar to a sub binary, but is optimized for binary matching; for instance, it contains a direct pointer to the binary data. For each field that is matched out of a binary, the position in the match context will be incremented.

 

相关资料:

[1] http://erlang.2086793.n4.nabble.com/Memory-Usage-td2117552.html

[2] Tuple memory size http://erlang.2086793.n4.nabble.com/Tuple-memory-size-td2112801.html