10-2 霍夫曼编码

霍夫曼编码(Huffman Coding)

霍夫曼编码(Huffman Coding)是一种基于字符频率的变长前缀编码(Variable-Length Prefix Code)算法。它的核心思想是:出现频率高的字符使用短编码,出现频率低的字符使用长编码,从而最小化整体编码长度。

霍夫曼编码是一种贪心算法(Greedy Algorithm),每次选择频率最低的两个节点合并。它生成的编码满足前缀性质(Prefix Property):任何字符的编码都不是另一个字符编码的前缀,因此解码时不会产生歧义。

霍夫曼编码广泛应用于文件压缩(如 ZIP、gzip)、图像压缩(如 JPEG)和多媒体编码中。


前缀编码(Prefix Code)

定长编码(Fixed-Length Code)为每个字符分配相同位数的编码。例如 4 个字符用 2 位:

字符   定长编码
  A      00
  B      01
  C      10
  D      11

如果一段文本中 A 出现频率远高于 D,定长编码会浪费空间。霍夫曼编码用变长编码解决这个问题:

字符   频率   霍夫曼编码
  A     5       0
  B     2      10
  C     1     110
  D     1     111

前缀性质保证了:遇到 0 就知道是 A,遇到 10 就知道是 B,不会产生歧义。

编码长度对比(以 "AABACABAD" 为例,A=5, B=2, C=1, D=1):

定长编码:9 字符 × 2 位 = 18 位
霍夫曼编码:5×1 + 2×2 + 1×3 + 1×3 = 15 位
节省:3 位 (16.7%)

霍夫曼树的构建

霍夫曼编码通过构建一棵霍夫曼树(Huffman Tree)来生成编码。构建过程如下:

  1. 将每个字符作为一个叶子节点,权重为其频率
  2. 选取频率最低的两个节点,合并为一个新节点(频率 = 两子节点频率之和)
  3. 将新节点放回集合中
  4. 重复步骤 2-3,直到只剩一个节点(根节点)

以字符频率 A=5, B=2, C=1, D=1 为例:

初始节点(按频率排序):
  C(1)  D(1)  B(2)  A(5)

Step 1: 合并 C(1) 和 D(1) → N1(2)
  B(2)  N1(2)  A(5)

      N1(2)
      /    \
    C(1)  D(1)

Step 2: 合并 B(2) 和 N1(2) → N2(4)
  N2(4)  A(5)

        N2(4)
       /    \
     B(2)  N1(2)
           /    \
         C(1)  D(1)

Step 3: 合并 N2(4) 和 A(5) → Root(9)

          Root(9)
          /    \
       A(5)   N2(4)
              /    \
            B(2)  N1(2)
                  /    \
                C(1)  D(1)

从根节点出发,左子树标记 0,右子树标记 1,即可得到每个字符的编码:

A: 0       (左)
B: 10      (右→左)
C: 110     (右→右→左)
D: 111     (右→右→右)

编码过程

有了编码表,将原始文本逐字符转换为二进制编码:

文本: A A B A C A B A D

A → 0
A → 0
B → 10
A → 0
C → 110
A → 0
B → 10
A → 0
D → 111

编码结果: 00 10 0 110 0 10 0 111
          = 001001100100111 (15 位)

C++ 实现

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

struct HuffmanNode {
    char ch;
    int freq;
    HuffmanNode* left;
    HuffmanNode* right;

    HuffmanNode(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct CompareNode {
    bool operator()(HuffmanNode* a, HuffmanNode* b) {
        return a->freq > b->freq;  // Min-heap: smallest freq on top
    }
};

// Build Huffman tree from frequency table
HuffmanNode* buildHuffmanTree(const unordered_map<char, int>& freq) {
    priority_queue<HuffmanNode*, vector<HuffmanNode*>, CompareNode> pq;

    for (auto& [ch, f] : freq)
        pq.push(new HuffmanNode(ch, f));

    while (pq.size() > 1) {
        HuffmanNode* left = pq.top(); pq.pop();
        HuffmanNode* right = pq.top(); pq.pop();

        HuffmanNode* parent = new HuffmanNode('\0', left->freq + right->freq);
        parent->left = left;
        parent->right = right;
        pq.push(parent);
    }

    return pq.top();
}

// Generate code table by traversing the tree
void generateCodes(HuffmanNode* node, string code,
                   unordered_map<char, string>& codes) {
    if (!node) return;
    if (!node->left && !node->right) {  // Leaf node
        codes[node->ch] = code;
        return;
    }
    generateCodes(node->left, code + "0", codes);
    generateCodes(node->right, code + "1", codes);
}

// Encode text using Huffman codes
string encode(const string& text, unordered_map<char, string>& codes) {
    string result;
    for (char c : text)
        result += codes[c];
    return result;
}

int main() {
    string text = "AABACABAD";

    // Count frequencies
    unordered_map<char, int> freq;
    for (char c : text) freq[c]++;

    // Build tree and generate codes
    HuffmanNode* root = buildHuffmanTree(freq);
    unordered_map<char, string> codes;
    generateCodes(root, "", codes);

    // Print code table
    cout << "Huffman Codes:" << endl;
    for (auto& [ch, code] : codes)
        cout << "  " << ch << " (freq=" << freq[ch] << "): " << code << endl;

    // Encode
    string encoded = encode(text, codes);
    cout << "\nOriginal text: " << text << endl;
    cout << "Encoded:       " << encoded << " (" << encoded.length() << " bits)" << endl;
    cout << "Fixed-length:  " << text.length() * 2 << " bits" << endl;

    return 0;
}

C 实现

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct HuffmanNode {
    char ch;
    int freq;
    struct HuffmanNode* left;
    struct HuffmanNode* right;
} HuffmanNode;

// Min-heap for Huffman nodes
typedef struct {
    HuffmanNode** data;
    int size;
    int capacity;
} MinHeap;

MinHeap* heap_create(int capacity) {
    MinHeap* h = (MinHeap*)malloc(sizeof(MinHeap));
    h->data = (HuffmanNode**)malloc(capacity * sizeof(HuffmanNode*));
    h->size = 0;
    h->capacity = capacity;
    return h;
}

void heap_swap(HuffmanNode** a, HuffmanNode** b) {
    HuffmanNode* tmp = *a; *a = *b; *b = tmp;
}

void heap_push(MinHeap* h, HuffmanNode* node) {
    int i = h->size++;
    h->data[i] = node;
    while (i > 0) {
        int parent = (i - 1) / 2;
        if (h->data[i]->freq >= h->data[parent]->freq) break;
        heap_swap(&h->data[i], &h->data[parent]);
        i = parent;
    }
}

HuffmanNode* heap_pop(MinHeap* h) {
    HuffmanNode* top = h->data[0];
    h->data[0] = h->data[--h->size];
    int i = 0;
    while (1) {
        int smallest = i;
        int left = 2 * i + 1, right = 2 * i + 2;
        if (left < h->size && h->data[left]->freq < h->data[smallest]->freq)
            smallest = left;
        if (right < h->size && h->data[right]->freq < h->data[smallest]->freq)
            smallest = right;
        if (smallest == i) break;
        heap_swap(&h->data[i], &h->data[smallest]);
        i = smallest;
    }
    return top;
}

HuffmanNode* node_create(char ch, int freq) {
    HuffmanNode* n = (HuffmanNode*)malloc(sizeof(HuffmanNode));
    n->ch = ch;
    n->freq = freq;
    n->left = n->right = NULL;
    return n;
}

// Code table storage
char code_table[256][256];
int code_lengths[256];

void generate_codes(HuffmanNode* node, char* buffer, int depth) {
    if (!node) return;
    if (!node->left && !node->right) {
        buffer[depth] = '\0';
        strcpy(code_table[(unsigned char)node->ch], buffer);
        code_lengths[(unsigned char)node->ch] = depth;
        return;
    }
    buffer[depth] = '0';
    generate_codes(node->left, buffer, depth + 1);
    buffer[depth] = '1';
    generate_codes(node->right, buffer, depth + 1);
}

int main() {
    const char* text = "AABACABAD";
    int len = strlen(text);

    // Count frequencies
    int freq[256] = {0};
    for (int i = 0; i < len; i++)
        freq[(unsigned char)text[i]]++;

    // Build Huffman tree
    MinHeap* heap = heap_create(256);
    for (int i = 0; i < 256; i++)
        if (freq[i] > 0)
            heap_push(heap, node_create((char)i, freq[i]));

    while (heap->size > 1) {
        HuffmanNode* left = heap_pop(heap);
        HuffmanNode* right = heap_pop(heap);
        HuffmanNode* parent = node_create('\0', left->freq + right->freq);
        parent->left = left;
        parent->right = right;
        heap_push(heap, parent);
    }

    HuffmanNode* root = heap_pop(heap);

    // Generate codes
    memset(code_table, 0, sizeof(code_table));
    char buffer[256];
    generate_codes(root, buffer, 0);

    // Print code table
    printf("Huffman Codes:\n");
    for (int i = 0; i < 256; i++) {
        if (freq[i] > 0)
            printf("  %c (freq=%d): %s\n", (char)i, freq[i], code_table[i]);
    }

    // Encode
    printf("\nOriginal text: %s\n", text);
    printf("Encoded:       ");
    int total_bits = 0;
    for (int i = 0; i < len; i++) {
        printf("%s", code_table[(unsigned char)text[i]]);
        total_bits += code_lengths[(unsigned char)text[i]];
    }
    printf(" (%d bits)\n", total_bits);
    printf("Fixed-length:  %d bits\n", len * 2);

    return 0;
}

Python 实现

import heapq
from collections import defaultdict

class HuffmanNode:
    def __init__(self, ch=None, freq=0, left=None, right=None):
        self.ch = ch
        self.freq = freq
        self.left = left
        self.right = right

    def __lt__(self, other):
        return self.freq < other.freq


def build_huffman_tree(freq):
    heap = [HuffmanNode(ch=ch, freq=f) for ch, f in freq.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        parent = HuffmanNode(freq=left.freq + right.freq, left=left, right=right)
        heapq.heappush(heap, parent)

    return heap[0]


def generate_codes(node, code="", codes=None):
    if codes is None:
        codes = {}
    if node is None:
        return codes
    if node.ch is not None:  # Leaf node
        codes[node.ch] = code
        return codes
    generate_codes(node.left, code + "0", codes)
    generate_codes(node.right, code + "1", codes)
    return codes


def encode(text, codes):
    return "".join(codes[c] for c in text)


text = "AABACABAD"

# Count frequencies
freq = defaultdict(int)
for c in text:
    freq[c] += 1

# Build tree and generate codes
root = build_huffman_tree(freq)
codes = generate_codes(root)

print("Huffman Codes:")
for ch in sorted(codes):
    print(f"  {ch} (freq={freq[ch]}): {codes[ch]}")

encoded = encode(text, codes)
print(f"\nOriginal text: {text}")
print(f"Encoded:       {encoded} ({len(encoded)} bits)")
print(f"Fixed-length:  {len(text) * 2} bits")

Go 实现

package main

import (
	"container/heap"
	"fmt"
)

type HuffmanNode struct {
	ch    byte
	freq  int
	left  *HuffmanNode
	right *HuffmanNode
}

type MinHeap []*HuffmanNode

func (h MinHeap) Len() int           { return len(h) }
func (h MinHeap) Less(i, j int) bool { return h[i].freq < h[j].freq }
func (h MinHeap) Swap(i, j int)      { h[i], h[j] = h[j], h[i] }

func (h *MinHeap) Push(x interface{}) {
	*h = append(*h, x.(*HuffmanNode))
}

func (h *MinHeap) Pop() interface{} {
	old := *h
	n := len(old)
	item := old[n-1]
	*h = old[:n-1]
	return item
}

func buildHuffmanTree(freq map[byte]int) *HuffmanNode {
	h := &MinHeap{}
	heap.Init(h)

	for ch, f := range freq {
		heap.Push(h, &HuffmanNode{ch: ch, freq: f})
	}

	for h.Len() > 1 {
		left := heap.Pop(h).(*HuffmanNode)
		right := heap.Pop(h).(*HuffmanNode)
		parent := &HuffmanNode{freq: left.freq + right.freq, left: left, right: right}
		heap.Push(h, parent)
	}

	return heap.Pop(h).(*HuffmanNode)
}

func generateCodes(node *HuffmanNode, code string, codes map[byte]string) {
	if node == nil {
		return
	}
	if node.left == nil && node.right == nil {
		codes[node.ch] = code
		return
	}
	generateCodes(node.left, code+"0", codes)
	generateCodes(node.right, code+"1", codes)
}

func encode(text string, codes map[byte]string) string {
	result := ""
	for i := 0; i < len(text); i++ {
		result += codes[text[i]]
	}
	return result
}

func main() {
	text := "AABACABAD"

	// Count frequencies
	freq := make(map[byte]int)
	for i := 0; i < len(text); i++ {
		freq[text[i]]++
	}

	// Build tree and generate codes
	root := buildHuffmanTree(freq)
	codes := make(map[byte]string)
	generateCodes(root, "", codes)

	fmt.Println("Huffman Codes:")
	for ch, code := range codes {
		fmt.Printf("  %c (freq=%d): %s\n", ch, freq[ch], code)
	}

	encoded := encode(text, codes)
	fmt.Printf("\nOriginal text: %s\n", text)
	fmt.Printf("Encoded:       %s (%d bits)\n", encoded, len(encoded))
	fmt.Printf("Fixed-length:  %d bits\n", len(text)*2)
}

运行该程序将输出:

Huffman Codes:
  A (freq=5): 0
  B (freq=2): 10
  C (freq=1): 110
  D (freq=1): 111

Original text: AABACABAD
Encoded:       001001100100111 (15 bits)
Fixed-length:  18 bits

解码过程

解码过程就是沿着霍夫曼树从根节点出发,逐位读取编码:

  • 遇到 0 走左子树
  • 遇到 1 走右子树
  • 到达叶子节点时输出对应字符,回到根节点继续

以编码 001001100100111 为例:

0 → 左 → 叶子 A        输出: A
0 → 左 → 叶子 A        输出: A
1 → 右, 0 → 左 → 叶子 B  输出: B
0 → 左 → 叶子 A        输出: A
1 → 右, 1 → 右, 0 → 左 → 叶子 C  输出: C
0 → 左 → 叶子 A        输出: A
1 → 右, 0 → 左 → 叶子 B  输出: B
0 → 左 → 叶子 A        输出: A
1 → 右, 1 → 右, 1 → 右 → 叶子 D  输出: D

解码结果: AABACABAD

C++ 实现

#include <iostream>
#include <string>
#include <queue>
#include <unordered_map>
using namespace std;

struct HuffmanNode {
    char ch;
    int freq;
    HuffmanNode* left;
    HuffmanNode* right;
    HuffmanNode(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct CompareNode {
    bool operator()(HuffmanNode* a, HuffmanNode* b) {
        return a->freq > b->freq;
    }
};

HuffmanNode* buildTree(const unordered_map<char, int>& freq) {
    priority_queue<HuffmanNode*, vector<HuffmanNode*>, CompareNode> pq;
    for (auto& [ch, f] : freq)
        pq.push(new HuffmanNode(ch, f));
    while (pq.size() > 1) {
        HuffmanNode* left = pq.top(); pq.pop();
        HuffmanNode* right = pq.top(); pq.pop();
        HuffmanNode* parent = new HuffmanNode('\0', left->freq + right->freq);
        parent->left = left;
        parent->right = right;
        pq.push(parent);
    }
    return pq.top();
}

void generateCodes(HuffmanNode* node, string code,
                   unordered_map<char, string>& codes) {
    if (!node) return;
    if (!node->left && !node->right) { codes[node->ch] = code; return; }
    generateCodes(node->left, code + "0", codes);
    generateCodes(node->right, code + "1", codes);
}

// Decode by traversing the tree bit by bit
string decode(HuffmanNode* root, const string& encoded) {
    string result;
    HuffmanNode* current = root;

    for (char bit : encoded) {
        if (bit == '0') current = current->left;
        else            current = current->right;

        if (!current->left && !current->right) {
            result += current->ch;
            current = root;  // Reset to root for next character
        }
    }

    return result;
}

int main() {
    string text = "AABACABAD";

    unordered_map<char, int> freq;
    for (char c : text) freq[c]++;

    HuffmanNode* root = buildTree(freq);

    unordered_map<char, string> codes;
    generateCodes(root, "", codes);

    // Encode
    string encoded;
    for (char c : text) encoded += codes[c];
    cout << "Encoded: " << encoded << endl;

    // Decode
    string decoded = decode(root, encoded);
    cout << "Decoded: " << decoded << endl;

    cout << "Match:   " << (text == decoded ? "yes" : "no") << endl;

    return 0;
}

C 实现

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct HuffmanNode {
    char ch;
    int freq;
    struct HuffmanNode* left;
    struct HuffmanNode* right;
} HuffmanNode;

typedef struct {
    HuffmanNode** data;
    int size;
    int capacity;
} MinHeap;

MinHeap* heap_create(int cap) {
    MinHeap* h = malloc(sizeof(MinHeap));
    h->data = malloc(cap * sizeof(HuffmanNode*));
    h->size = 0; h->capacity = cap;
    return h;
}

void heap_push(MinHeap* h, HuffmanNode* node) {
    int i = h->size++;
    h->data[i] = node;
    while (i > 0) {
        int p = (i - 1) / 2;
        if (h->data[i]->freq >= h->data[p]->freq) break;
        HuffmanNode* tmp = h->data[i]; h->data[i] = h->data[p]; h->data[p] = tmp;
        i = p;
    }
}

HuffmanNode* heap_pop(MinHeap* h) {
    HuffmanNode* top = h->data[0];
    h->data[0] = h->data[--h->size];
    int i = 0;
    while (1) {
        int s = i, l = 2*i+1, r = 2*i+2;
        if (l < h->size && h->data[l]->freq < h->data[s]->freq) s = l;
        if (r < h->size && h->data[r]->freq < h->data[s]->freq) s = r;
        if (s == i) break;
        HuffmanNode* tmp = h->data[i]; h->data[i] = h->data[s]; h->data[s] = tmp;
        i = s;
    }
    return top;
}

HuffmanNode* build_tree(int* freq) {
    MinHeap* h = heap_create(256);
    for (int i = 0; i < 256; i++)
        if (freq[i] > 0) {
            HuffmanNode* n = malloc(sizeof(HuffmanNode));
            n->ch = (char)i; n->freq = freq[i]; n->left = n->right = NULL;
            heap_push(h, n);
        }
    while (h->size > 1) {
        HuffmanNode* left = heap_pop(h);
        HuffmanNode* right = heap_pop(h);
        HuffmanNode* parent = malloc(sizeof(HuffmanNode));
        parent->ch = '\0'; parent->freq = left->freq + right->freq;
        parent->left = left; parent->right = right;
        heap_push(h, parent);
    }
    HuffmanNode* root = heap_pop(h);
    free(h->data); free(h);
    return root;
}

char code_table[256][256];
int code_lens[256];

void gen_codes(HuffmanNode* node, char* buf, int depth) {
    if (!node) return;
    if (!node->left && !node->right) {
        buf[depth] = '\0';
        strcpy(code_table[(unsigned char)node->ch], buf);
        code_lens[(unsigned char)node->ch] = depth;
        return;
    }
    buf[depth] = '0'; gen_codes(node->left, buf, depth + 1);
    buf[depth] = '1'; gen_codes(node->right, buf, depth + 1);
}

// Decode by traversing tree
void decode(HuffmanNode* root, const char* encoded, char* output) {
    int out_idx = 0;
    HuffmanNode* cur = root;
    for (int i = 0; encoded[i]; i++) {
        cur = (encoded[i] == '0') ? cur->left : cur->right;
        if (!cur->left && !cur->right) {
            output[out_idx++] = cur->ch;
            cur = root;
        }
    }
    output[out_idx] = '\0';
}

int main() {
    const char* text = "AABACABAD";
    int len = strlen(text);

    int freq[256] = {0};
    for (int i = 0; i < len; i++) freq[(unsigned char)text[i]]++;

    HuffmanNode* root = build_tree(freq);

    memset(code_table, 0, sizeof(code_table));
    memset(code_lens, 0, sizeof(code_lens));
    char buf[256];
    gen_codes(root, buf, 0);

    // Encode
    char encoded[1024] = "";
    for (int i = 0; i < len; i++)
        strcat(encoded, code_table[(unsigned char)text[i]]);
    printf("Encoded: %s\n", encoded);

    // Decode
    char decoded[1024];
    decode(root, encoded, decoded);
    printf("Decoded: %s\n", decoded);

    printf("Match:   %s\n", strcmp(text, decoded) == 0 ? "yes" : "no");

    return 0;
}

Python 实现

import heapq
from collections import defaultdict

class HuffmanNode:
    def __init__(self, ch=None, freq=0, left=None, right=None):
        self.ch = ch
        self.freq = freq
        self.left = left
        self.right = right
    def __lt__(self, other):
        return self.freq < other.freq

def build_tree(freq):
    h = [HuffmanNode(ch=ch, freq=f) for ch, f in freq.items()]
    heapq.heapify(h)
    while len(h) > 1:
        left, right = heapq.heappop(h), heapq.heappop(h)
        heapq.heappush(h, HuffmanNode(freq=left.freq + right.freq,
                                      left=left, right=right))
    return h[0]

def gen_codes(node, code="", codes=None):
    if codes is None:
        codes = {}
    if node is None:
        return codes
    if node.ch is not None:
        codes[node.ch] = code
        return codes
    gen_codes(node.left, code + "0", codes)
    gen_codes(node.right, code + "1", codes)
    return codes

def decode(root, encoded):
    result = []
    current = root
    for bit in encoded:
        current = current.left if bit == "0" else current.right
        if current.ch is not None:  # Leaf
            result.append(current.ch)
            current = root
    return "".join(result)


text = "AABACABAD"
freq = defaultdict(int)
for c in text:
    freq[c] += 1

root = build_tree(freq)
codes = gen_codes(root)

encoded = "".join(codes[c] for c in text)
print(f"Encoded: {encoded}")

decoded = decode(root, encoded)
print(f"Decoded: {decoded}")
print(f"Match:   {'yes' if text == decoded else 'no'}")

Go 实现

package main

import (
	"container/heap"
	"fmt"
)

type HuffmanNode struct {
	ch    byte
	freq  int
	left  *HuffmanNode
	right *HuffmanNode
}

type MinHeap []*HuffmanNode

func (h MinHeap) Len() int           { return len(h) }
func (h MinHeap) Less(i, j int) bool { return h[i].freq < h[j].freq }
func (h MinHeap) Swap(i, j int)      { h[i], h[j] = h[j], h[i] }
func (h *MinHeap) Push(x interface{}) { *h = append(*h, x.(*HuffmanNode)) }
func (h *MinHeap) Pop() interface{} {
	old := *h
	n := len(old)
	item := old[n-1]
	*h = old[:n-1]
	return item
}

func buildTree(freq map[byte]int) *HuffmanNode {
	h := &MinHeap{}
	heap.Init(h)
	for ch, f := range freq {
		heap.Push(h, &HuffmanNode{ch: ch, freq: f})
	}
	for h.Len() > 1 {
		left := heap.Pop(h).(*HuffmanNode)
		right := heap.Pop(h).(*HuffmanNode)
		heap.Push(h, &HuffmanNode{freq: left.freq + right.freq, left: left, right: right})
	}
	return heap.Pop(h).(*HuffmanNode)
}

func genCodes(node *HuffmanNode, code string, codes map[byte]string) {
	if node == nil {
		return
	}
	if node.left == nil && node.right == nil {
		codes[node.ch] = code
		return
	}
	genCodes(node.left, code+"0", codes)
	genCodes(node.right, code+"1", codes)
}

func decode(root *HuffmanNode, encoded string) string {
	var result []byte
	current := root
	for i := 0; i < len(encoded); i++ {
		if encoded[i] == '0' {
			current = current.left
		} else {
			current = current.right
		}
		if current.left == nil && current.right == nil {
			result = append(result, current.ch)
			current = root
		}
	}
	return string(result)
}

func main() {
	text := "AABACABAD"

	freq := make(map[byte]int)
	for i := 0; i < len(text); i++ {
		freq[text[i]]++
	}

	root := buildTree(freq)
	codes := make(map[byte]string)
	genCodes(root, "", codes)

	encoded := ""
	for i := 0; i < len(text); i++ {
		encoded += codes[text[i]]
	}
	fmt.Printf("Encoded: %s\n", encoded)

	decoded := decode(root, encoded)
	fmt.Printf("Decoded: %s\n", decoded)

	match := "no"
	if text == decoded {
		match = "yes"
	}
	fmt.Printf("Match:   %s\n", match)
}

运行该程序将输出:

Encoded: 001001100100111
Decoded: AABACABAD
Match:   yes

编码长度的计算

对于给定频率分布,霍夫曼编码的总位数可以通过树的构建过程直接计算:

总位数 = Σ (字符频率 × 编码长度)
       = Σ (叶子节点深度 × 该叶子的频率)

对于 A=5, B=2, C=1, D=1:

总位数 = 5 × 1 (A: 0)
       + 2 × 2 (B: 10)
       + 1 × 3 (C: 110)
       + 1 × 3 (D: 111)
       = 5 + 4 + 3 + 3
       = 15 位

也可以从树的角度理解:每次合并操作产生一个内部节点,该节点的频率等于其子树中所有叶子频率之和。总编码长度等于所有内部节点的频率之和:

N1(2) = C + D = 2   → 贡献 2 位
N2(4) = B + C + D = 4 → 贡献 4 位
Root(9) = A + B + C + D = 9 → 根节点不贡献

总位数 = 2 + 4 = 6... 不对

修正:总位数 = Σ (内部节点频率) = N1 + N2 = 2 + 4 = 6
但这不是总位数。

实际上更准确的计算是:总位数 = 加权路径长度(Weighted Path Length)= 所有叶子节点 (频率 × 深度) 的总和。


最优性证明

霍夫曼编码是最优前缀编码——在所有可能的前缀编码中,霍夫曼编码的总编码长度最短。

这个最优性基于以下贪心选择性质:

  1. 频率最低的两个字符编码长度一定相同(否则可以交换来减小总长度)
  2. 频率最低的两个字符在树中一定是兄弟(深度最大且相同)
  3. 将这两个字符合并后,子问题仍然满足贪心选择性质

因此,霍夫曼算法的贪心策略(每次合并频率最低的两个节点)能够产生全局最优解。


霍夫曼编码的性质

压缩性能

度量 公式 说明
编码总长度 Σ fi × li fi = 字符 i 的频率,li = 编码长度
平均编码长度 Σ pi × li pi = 字符 i 的概率
压缩比 编码长度 / (n × log₂k) n = 字符数,k = 字符集大小
理论下界 H = -Σ pi × log₂(pi) 香农熵(Shannon Entropy)

霍夫曼编码的平均编码长度满足:

H ≤ 平均编码长度 < H + 1

其中 H 是信源的香农熵。这意味着霍夫曼编码最多比理论最优多 1 位/字符。

前缀性质

霍夫曼编码保证是前缀码(Prefix Code):任何字符的编码都不是另一个字符编码的前缀。这可以从树的结构来理解——所有字符都在叶子节点上,从根到任何叶子节点的路径不会经过另一个叶子节点。

树的结构

性质 说明
二叉树 每个内部节点恰好有两个子节点
完全性 如果频率都是正数,则树是满的(无单子节点)
叶子节点 所有字符都对应叶子节点
内部节点 仅用于树的结构,不对应任何字符

编码的唯一性

对于相同的频率分布,霍夫曼树可能不唯一:

  • 合并频率相同的节点时,谁做左子树谁做右子树不影响最优性
  • 频率相同时,合并顺序可能不同
  • 但所有最优霍夫曼树的总编码长度相同

应用场景

应用 说明
DEFLATE 压缩 ZIP / gzip 的核心算法之一(结合 LZ77)
JPEG 图像压缩 对 DCT 系数进行霍夫曼编码
MP3 音频压缩 对量化后的频谱数据进行编码
PNG 图像 可选的霍夫曼编码过滤器
通信协议 哈夫曼编码减少传输数据量
文本压缩 基于字符频率的通用压缩

与其他编码的对比

编码方式 类型 最优性 实现复杂度
定长编码 固定长度 无压缩效果 O(1)
霍夫曼编码 变长前缀码 字符级最优 O(n log n)
算术编码 变长流式编码 逼近熵极限 O(n)
Shannon-Fano 变长前缀码 接近最优 O(n log n)

霍夫曼编码是字符级最优的前缀编码,但不如算术编码(Arithmetic Coding)高效——算术编码可以将多个字符一起编码,更接近理论熵极限。

posted @ 2026-04-18 00:34  游翔  阅读(7)  评论(0)    收藏  举报