# 1. 压缩编码概述

## 期望编码长度

$L = \sum\limits_{i = 1}^n {p_i*l_i}$

$\forall i,j \ \ \ p_i \ge p_j \Leftrightarrow l_i \le l_j$

## 最优编码

$\sum\limits_{i = 1}^n {{2^{ - l_i}}} \le 1$

\begin{align}
& \min \sum\limits_{i = 1}^n {{p_i}*{l_i}} \cr
& s.t. \ \sum {{2^{ - {l_i}}}} \le 1 \label{eq:kraft}
\end{align}

J = \sum {p_i*l_i + \lambda (\sum {{2^{ - l_i}}} } )

$l_i$求偏导，

${{\partial J} \over {\partial l_i}} = p_i - \lambda {2^{ - l_i}}\ln 2$

${2^{ - l_i}} = {{p_i} \over {\lambda \ln 2}}$

l_i = - \log _{2}p_i

\sum\limits_{i} {p_il_i = - \sum\limits_{i} {p_i \log p_i} } = H(A)

\rho = L - H(A)

H(A) \le L \le H(A) + 1

# 2. Huffman编码

Huffman编码采用小顶堆来优化编码二叉树的建立过程，确保低概率的码字字长不短于高概率的码字，具体编码过程如下：

1. 将字符空间的字符以概率为关键值建立小顶堆；
2. 依次取堆顶元素两次，将该两个字符合成一棵二叉树，根节点的关键值为两个字符的概率相加；然后将该新合成的二叉树做为节点插入到小顶堆中；
3. 重复步骤2直至小顶堆中只有一个节点，此节点即为编码二叉树。

Python 3.6实现Huffman编码，代码参考了rosettacode

# -*- coding: utf-8 -*-
# @Time    : 2017/1/21
# @Author  : rain
from collections import Counter
from heapq import heapify, heappop, heappush

def huffman_coding(message):
freq = Counter(message)  # counter for every character
min_heap = [[cnt, [ch, '']] for ch, cnt in freq.items()]
heapify(min_heap)
while len(min_heap) > 1:
low1 = heappop(min_heap)
low2 = heappop(min_heap)
for pair in low1[1:]:  # update children node
pair[1] += '0'
for pair in low2[1:]:  # update children node
pair[1] += '1'
# push [low1_cnt+low2_cnt, low1's children, low2's children]
heappush(min_heap, [low1[0] + low2[0]] + low1[1:] + low2[1:])
vocabulary = {pair[0]: pair[1] for pair in min_heap[0][1:]}  # text -> code
return vocabulary

sentence = 'this is an example for huffman encoding'
print(huffman_coding(sentence))


# 3. 参考资料

[1] Huffman, David A. "A method for the construction of minimum-redundancy codes." Proceedings of the IRE 40.9 (1952): 1098-1101.
[2] Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
[3] Bernd Girod, EE398A Image and Video Compression.

posted @ 2015-11-20 11:26  Treant  阅读(...)  评论(...编辑  收藏