# 1. 引言

LZ77算法是采用字典做数据压缩的算法，由以色列的两位大神Jacob Ziv与Abraham Lempel在1977年发表的论文《A Universal Algorithm for Sequential Data Compression》中提出。

取之以仁义，者守之以也诈力。均重复出现过，只需指出其之前出现的位置，便可表示这些词。为了指明出现位置，我们定义一个相对位置，如图

# 2. 原理

## 滑动窗口

$L_i^j = \max \{ l | S_{i,i+l-1} = S_{j+1,j+l} \} \quad \text{subject to} \quad l \le N-j$

$p^j = \mathop {\arg \max }\limits_{i} \{ L_i^j \} \quad \text{subject to} \quad 1 \le i \le j$

• $L_1^j=1$，因为$S_{j+1,j+1}=S_{1,1}$, $S_{j+1,j+2} \ne S_{1,2}$;
• $L_2^j=4$，因为$S_{j+1,j+1}=S_{2,2}$, $S_{j+1,j+2} = S_{2,3}$$S_{j+1,j+3} = S_{2,4}$$S_{j+1,j+4} = S_{2,5}$$S_{j+1,j+5} \ne S_{2,6}$
• $L_3^j = 0$，因为$S_{j+1,j+1} \ne S_{3,3}$

## 压缩

$(p,l,c)$表示Lookahead Buffer中字符串的最长匹配结果，其中

• $p$表示最长匹配时，字典中字符开始时的位置（相对于Cursor位置），
• $l$为最长匹配字符串的长度，
• $c$指Lookahead Buffer最长匹配结束时的下一字符

Repeat:
Output (p,l,c),
Cursor --> l+1
Until to the end of string


## 解压缩

• $p==0$$l==0$，即初始情况，直接解码$c$
• $p>=l$，解码为字典dict[p:p+l+1]
• $p<l$，即出现循环编码，需要从左至右循环拼接，伪代码如下：
for(i = p, k = 0; k < length; i++, k++)
out[cursor+k] = dict[i%cursor]


# 3. 实现

bitarray的实现请参看A Python LZ77-Compressor，下面给出简单的python实现。

# coding=utf-8

class LZ77:
"""
A simplified implementation of LZ77 algorithm
"""

def __init__(self, window_size):
self.window_size = window_size
self.buffer_size = 4

def longest_match(self, data, cursor):
"""
find the longest match between in dictionary and lookahead-buffer
"""
end_buffer = min(cursor + self.buffer_size, len(data))

p = -1
l = -1
c = ''

for j in range(cursor+1, end_buffer+1):
start_index = max(0, cursor - self.window_size + 1)
substring = data[cursor + 1:j + 1]

for i in range(start_index, cursor+1):
repetition = len(substring) / (cursor - i + 1)
last = len(substring) % (cursor - i + 1)
matchedstring = data[i:cursor + 1] * repetition + data[i:i + last]

if matchedstring == substring and len(substring) > l:
p = cursor - i + 1
l = len(substring)
c = data[j+1]

# unmatched string between the two
if p == -1 and l == -1:
return 0, 0, data[cursor + 1]
return p, l, c

def compress(self, message):
"""
compress message
:return: tuples (p, l, c)
"""
i = -1
out = []

# the cursor move until it reaches the end of message
while i < len(message)-1:
(p, l, c) = self.longest_match(message, i)
out.append((p, l, c))
i += (l+1)
return out

def decompress(self, compressed):
"""
decompress the compressed message
:param compressed: tuples (p, l, c)
:return: decompressed message
"""
cursor = -1
out = ''

for (p, l, c) in compressed:
# the initialization
if p == 0 and l == 0:
out += c
elif p >= l:
out += (out[cursor-p+1:cursor+1] + c)

# the repetition of dictionary
elif p < l:
repetition = l / p
last = l % p
out += (out[cursor-p+1:cursor+1] * repetition + out[cursor-p+1:last] + c)
cursor += (l + 1)

return out

if __name__ == '__main__':
compressor = LZ77(6)
origin = list('aacaacabcabaaac')
pack = compressor.compress(origin)
unpack = compressor.decompress(pack)
print pack
print unpack
print unpack == 'aacaacabcabaaac'


# 4. 参考资料

[1] Ziv, Jacob, and Abraham Lempel. "A universal algorithm for sequential data compression." IEEE Transactions on information theory 23.3 (1977): 337-343.
[2] guyb, 15-853:Algorithms in the Real World.

posted @ 2015-11-24 21:12  Treant  阅读(25834)  评论(5编辑  收藏