python之打包压缩(.tar.bz2、.tar.xz)

一、常见的压缩性能对比

1、主流压缩算法对比

算法	压缩比	速度	内存占用	最佳场景
DEFLATE (gzip)	中等	★★★★	低	通用数据、网络传输
bzip2	★★★	★★	中等	文本数据长期存储
LZMA	★★★★	★	高	冷数据归档
Zstandard	★★	★★★★	中等	实时数据处理

从上表可见，bz2在压缩比和速度间取得了良好平衡，特别适合对压缩率要求较高的文本数据存储。

二、bz2 压缩实操

Python的bz2模块正是这一算法的标准实现，无需额外安装，直接导入即可使用：

import bz2

1、文件压缩与解压

文件级操作是bz2模块最常用的功能，类似于Python内置的open()函数：

import bz2

# 压缩字符串
data = "这是要压缩的大量文本内容。" * 10000
compressed = bz2.compress(data.encode('utf-8'))
print(f"原始大小: {len(data.encode('utf-8'))} 字节")
print(f"压缩后大小: {len(compressed)} 字节")

# 解压
decompressed = bz2.decompress(compressed).decode('utf-8')
print("解压成功:", decompressed == data)

##########################################################
写入压缩文件
with bz2.open('data.txt.bz2', 'wb', compresslevel=9) as f:
    f.write(b"Large content needs compressing...")

# 读取压缩文件
with bz2.open('data.txt.bz2', 'rb') as f:
    content = f.read()
    print(content.decode())

这里的compresslevel参数（1-9）控制压缩强度：1级压缩最快但压缩率最低，9级压缩最慢但压缩率最高，默认值为9。

2、内存数据压缩与解压

对于较小数据，可以直接在内存中进行压缩和解压操作：

original_data = b"Repeated patterns compress well with bzip2..."

# 压缩数据
compressed = bz2.compress(original_data, compresslevel=7)
print(f"压缩后大小：{len(compressed)}字节，压缩率：{len(compressed)/len(original_data):.1%}")

# 解压数据
restored = bz2.decompress(compressed)
assert original_data == restored

3、增量处理大文件

对于不适合一次性加载到内存的超大文件，可以使用增量压缩/解压：

# 增量压缩
compressor = bz2.BZ2Compressor(compresslevel=5)
chunks =[b'Chunk one',b'Chunk two',b'Chunk three']
compressed_chunks =[]

for chunk in chunks:
    compressed_chunks.append(compressor.compress(chunk))
compressed_chunks.append(compressor.flush())# 结束压缩

# 增量解压
decompressor = bz2.BZ2Decompressor()
result =b""
for chunk in compressed_chunks:
    result += decompressor.decompress(chunk)

这种方法特别适合处理网络流或实时生成的数据。

4、数据存档与长期存储

在数据分析领域，原始数据往往包含大量重复信息。使用bz2压缩可以显著减少存储成本：

import pandas as pd
import bz2

# 保存压缩的CSV
df = pd.read_csv('large_dataset.csv')
with bz2.open('large_dataset.csv.bz2', 'wt') as f:
    df.to_csv(f, index=False)

# 从压缩文件读取
with bz2.open('large_dataset.csv.bz2', 'rt') as f:
    restored_df = pd.read_csv(f)

实验表明，对于文本格式的数据集，bz2通常比gzip多节省15-25%的空间。

5、日志文件处理

服务器日志通常体积庞大但压缩率高，适合使用bz2：

import bz2

def compress_logs(log_path, output_path):
    try:
        with open(log_path, 'rb') as f_in:
            with bz2.open(output_path, 'wb') as f_out:
                while True:
                    chunk = f_in.read(1024 * 1024)  # 每次读取1MB
                    if not chunk:
                        break
                    f_out.write(chunk)
        print(f"日志已压缩保存至: {output_path}")
    except FileNotFoundError:
        print(f"错误：文件未找到 - {log_path}")
    except PermissionError:
        print(f"错误：权限不足，无法读取 {log_path} 或写入 {output_path}")
    except Exception as e:
        print(f"压缩过程中发生错误: {e}")

6、网络数据传输优化

在微服务架构中，压缩API响应可大幅减少传输时间：

from flask import Flask, Response
import bz2
import json

app = Flask(__name__)

@app.route('/large-data')
def get_large_data():
    data = generate_large_json()  # 生成大量数据
    compressed = bz2.compress(json.dumps(data).encode())
    return Response(compressed, headers={'Content-Encoding': 'bzip2'})

逐行解释：

@app.route('/large-data')
定义一个路由，当用户访问 /large-data 时，调用 get_large_data() 函数。
data = generate_large_json()
调用一个假设存在的函数，生成一个大型的 Python 数据结构（如列表或字典），准备返回给客户端。
⚠️ 注意：这个函数在代码中未定义，只是一个占位符。
compressed = bz2.compress(json.dumps(data).encode())
- json.dumps(data): 将 data 转为 JSON 格式的字符串。
- .encode(): 将字符串编码为字节（bytes），因为 bz2.compress 需要字节输入。
- bz2.compress(...): 使用 bzip2 算法压缩这些字节。
- 结果是：一个压缩后的二进制数据（bytes 类型）。
return Response(compressed, headers={'Content-Encoding': 'bzip2'})
- 返回一个 Response 对象，包含压缩后的数据。
- 设置响应头 Content-Encoding: bzip2，告诉客户端：响应体是用 bzip2 压缩过的。

7、高效处理大文件（流式压缩，节省内存）

适用于日志、JSON、CSV 等大文件，避免一次性加载到内存。

import bz2

def compress_large_file(input_path, output_path, chunk_size=1024*1024):  # 1MB 每块
    """
    流式压缩大文件，避免内存溢出
    """
    with open(input_path, 'rb') as f_in:
        with bz2.open(output_path, 'wb') as f_out:
            while True:
                chunk = f_in.read(chunk_size)
                if not chunk:
                    break
                f_out.write(chunk)
    print(f"✅ 压缩完成: {input_path} -> {output_path}")

def decompress_large_file(compressed_path, output_path):
    """
    流式解压 bzip2 文件
    """
    with bz2.open(compressed_path, 'rb') as f_in:
        with open(output_path, 'wb') as f_out:
            while True:
                chunk = f_in.read(1024 * 1024)
                if not chunk:
                    break
                f_out.write(chunk)
    print(f"✅ 解压完成: {compressed_path} -> {output_path}")

# 使用示例
compress_large_file('large_log.txt', 'large_log.txt.bz2')
decompress_large_file('large_log.txt.bz2', 'large_log_decompressed.txt')

8、多进程并行压缩多个文件（提升批量处理速度）

利用 concurrent.futures.ProcessPoolExecutor 并行压缩多个文件

import bz2
from concurrent.futures import ProcessPoolExecutor, as_completed
import os

def compress_file(args):
    """
    单个文件压缩函数，供多进程使用
    args: (input_path, output_path)
    """
    input_path, output_path = args
    try:
        with open(input_path, 'rb') as f_in:
            with bz2.open(output_path, 'wb', compresslevel=6) as f_out:
                while chunk := f_in.read(1024*1024):
                    f_out.write(chunk)
        print(f"📦 已压缩: {input_path} -> {output_path}")
        return input_path, True
    except Exception as e:
        print(f"❌ 压缩失败 {input_path}: {e}")
        return input_path, False

def parallel_compress(files, max_workers=None):
    """
    并行压缩多个文件
    files: 列表，元素为 (input_path, output_path)
    max_workers: 最大进程数，默认为 CPU 核心数
    """
    if max_workers is None:
        max_workers = os.cpu_count()

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(compress_file, file) for file in files]
        
        results = {}
        for future in as_completed(futures):
            filename, success = future.result()
            results[filename] = success

    return results

# 使用示例：批量压缩多个日志文件
if __name__ == '__main__':
    files_to_compress = [
        ('log1.txt', 'log1.txt.bz2'),
        ('log2.txt', 'log2.txt.bz2'),
        ('log3.txt', 'log3.txt.bz2'),
    ]

    # 确保输入文件存在（示例）
    for i in range(1, 4):
        with open(f'log{i}.txt', 'w') as f:
            f.write(f"这是日志文件 {i} 的模拟内容。\n" * 10000)

    # 开始并行压缩
    result = parallel_compress(files_to_compress)
    print("最终结果:", result)

三、LZMA 压缩实操

lzma 是 Python 标准库中的一个模块，用于读写使用 LZMA（Lempel-Ziv-Markov chain Algorithm）压缩算法的数据。LZMA 是 7-Zip 压缩工具的核心算法，以极高的压缩率著称，特别适合压缩大文本文件、日志、备份等。

支持 .xz 和 .lzma 格式（.xz 更现代，推荐使用）

1、Python 3.3+ 内置 `lzma` 模块，常用类和函数：

压缩/解压字节数据	`lzma.compress()`,`lzma.decompress()`
读写压缩文件	`lzma.open()`（类似`gzip.open()`）
创建压缩器/解压器对象	`lzma.LZMACompressor()`,`lzma.LZMADecompressor()`

2、压缩和解压字节数据

import lzma

# 原始数据（字节）
data = b"Hello World! " * 1000  # 重复文本，利于压缩

# 压缩
compressed = lzma.compress(data)
print(f"原始大小: {len(data)} 字节")
print(f"压缩后大小: {len(compressed)} 字节")
print(f"压缩率: {len(compressed) / len(data):.2%}")

# 解压
decompressed = lzma.decompress(compressed)
assert data == decompressed  # 验证一致性
print("✅ 解压成功，数据一致")

3、读写 `.xz` 文件（推荐格式）

import lzma

# 写入压缩文件
text = "这是用于测试的中文文本。" * 1000
with lzma.open("data.txt.xz", "wt", encoding="utf-8") as f:
    f.write(text)

print("✅ 文件已压缩保存为 data.txt.xz")

# 读取压缩文件
with lzma.open("data.txt.xz", "rt", encoding="utf-8") as f:
    content = f.read()

assert content == text
print("✅ 文件解压读取成功，内容一致")

lzma.open() 用法与 open() 类似，支持文本模式 "t" 和二进制 "b"。

4、压缩大文件（分块处理）

当处理大文件时，应避免一次性加载到内存。

import lzma

def compress_file(input_path, output_path):
    with open(input_path, 'rb') as f_in:
        with lzma.open(output_path, 'wb') as f_out:
            # 分块读取，避免内存溢出
            for chunk in iter(lambda: f_in.read(1024 * 1024), b''):  # 每次读1MB
                f_out.write(chunk)
    print(f"✅ 已压缩: {input_path} -> {output_path}")

def decompress_file(input_path, output_path):
    with lzma.open(input_path, 'rb') as f_in:
        with open(output_path, 'wb') as f_out:
            for chunk in iter(lambda: f_in.read(1024 * 1024), b''):
                f_out.write(chunk)
    print(f"✅ 已解压: {input_path} -> {output_path}")

# 示例：创建一个测试文件
with open("test_large.txt", "w", encoding="utf-8") as f:
    f.write("测试数据\n" * 50000)

# 压缩
compress_file("test_large.txt", "test_large.txt.xz")

# 解压
decompress_file("test_large.txt.xz", "test_large_restored.txt")

5、调整压缩级别（1~9）

import lzma

data = b"Hello World! " * 1000

# 测试不同压缩级别
for level in range(1, 10):
    compressed = lzma.compress(data, format=lzma.FORMAT_XZ, preset=level)
    ratio = len(compressed) / len(data)
    print(f"压缩级别 {level}: {len(compressed)} 字节, 压缩率 {ratio:.2%}")

6、流式压缩（高级用法）

适用于网络传输或实时压缩。

import lzma

compressor = lzma.LZMACompressor(preset=6)

chunks = [b"Hello ", b"World ", b"from ", b"Python!"]
compressed_data = b""

for chunk in chunks:
    compressed_data += compressor.compress(chunk)

# 完成压缩
compressed_data += compressor.flush()

print(f"流式压缩完成，大小: {len(compressed_data)} 字节")

# 解压
decompressor = lzma.LZMADecompressor()
decompressed = decompressor.decompress(compressed_data)
print(f"解压后: {decompressed.decode('utf-8')}")

7、tar 打包 + LZMA 高压缩

1、函数

def compress_day(minio_client, date, docs, archive_base_dir):
    archive_path = os.path.join(archive_base_dir, date[:4], f"{date}.tar.xz")
    ensure_dir(os.path.dirname(archive_path))
    tar_stream = io.BytesIO()
    with tarfile.open(fileobj=tar_stream, mode='w') as tar:
        for doc in docs:
            try:
                object_path = parse_object_path(doc['DOC_PATH'])
                resp = minio_client.get_object(EMR_BUCKET, object_path)
                content = resp.read()
                tarinfo = tarfile.TarInfo(name=object_path)
                tarinfo.size = len(content)
                tar.addfile(tarinfo, io.BytesIO(content))
                resp.close()
                resp.release_conn()
            except Exception as e:
                logging.warning(f"[Skipped] {object_path}: {e}")
    with lzma.open(archive_path, 'wb') as f:
        f.write(tar_stream.getvalue())
    return archive_path

2、解析

这个函数 compress_day 是一个用于将某一天的文档批量归档并压缩为 .tar.xz 文件的实用函数，结合了 MinIO 对象存储读取 + tar 打包 + LZMA 高压缩率压缩的完整流程。

使用内存流打包 tar 文件

tar_stream = io.BytesIO()
with tarfile.open(fileobj=tar_stream, mode='w') as tar:
    ...

io.BytesIO() 创建一个内存中的字节流，作为 tarfile 的“文件对象”。
mode='w' 表示写入未压缩的 tar 流（后续再用 LZMA 压缩）。

⚠️ 注意：这里打包的是原始 tar，没有压缩，真正的压缩在下一步用 xz 完成。

遍历文档，从 MinIO 下载并添加到 tar 包

for doc in docs:
    try:
        object_path = parse_object_path(doc['DOC_PATH'])  # 解析对象路径
        resp = minio_client.get_object(EMR_BUCKET, object_path)
        content = resp.read()
        tarinfo = tarfile.TarInfo(name=object_path)
        tarinfo.size = len(content)
        tar.addfile(tarinfo, io.BytesIO(content))
        resp.close()
        resp.release_conn()
    except Exception as e:
        logging.warning(f"[Skipped] {object_path}: {e}")

parse_object_path()：解析 DOC_PATH 供下载
minio_client.get_object()：从 MinIO 下载文件内容到内存。
构造 TarInfo 并设置大小。
使用 tar.addfile() 将内容写入 tar 流。
正确关闭响应连接（避免连接泄露）。

with lzma.open(archive_path, 'wb') as f:
    f.write(tar_stream.getvalue())

把内存中完整的 tar 数据（tar_stream.getvalue()）用 lzma 压缩为 .tar.xz 文件。
.tar.xz = tar 打包 + xz（LZMA 算法）压缩，压缩率极高。

使用标准库 lzma，无需额外依赖。
.xz 格式适合长期归档，节省大量磁盘空间。

return archive_path

返回生成的归档文件路径，便于后续处理（如上传、通知、记录日志等）。

posted @ 2025-08-01 17:25 凡人半睁眼阅读(67) 评论(0) 收藏举报

刷新页面返回顶部

海棠未雨，梨花先雪，一半春休

想看山海，早也去，晚也去，一个人也去

python之打包压缩(.tar.bz2、.tar.xz)

一、常见的压缩性能对比

1、主流压缩算法对比

二、bz2 压缩实操

1、文件压缩与解压

2、内存数据压缩与解压

3、增量处理大文件

4、数据存档与长期存储

5、日志文件处理

6、网络数据传输优化

三、LZMA 压缩实操

1、Python 3.3+ 内置 `lzma` 模块，常用类和函数：

2、压缩和解压字节数据

3、读写 `.xz` 文件（推荐格式）

4、压缩大文件（分块处理）

5、调整压缩级别（1~9）

6、流式压缩（高级用法）

7、tar 打包 + LZMA 高压缩

公告

海棠未雨，梨花先雪，一半春休

想看山海，早也去，晚也去 ，一个人也去

python之打包压缩(.tar.bz2、.tar.xz)

一、常见的压缩性能对比

1、主流压缩算法对比

二、bz2 压缩实操

1、文件压缩与解压

2、内存数据压缩与解压

3、增量处理大文件

4、数据存档与长期存储

5、日志文件处理

6、网络数据传输优化

三、LZMA 压缩实操

1、Python 3.3+ 内置 lzma 模块，常用类和函数：

2、压缩和解压字节数据

3、读写 .xz 文件（推荐格式）

4、压缩大文件（分块处理）

5、调整压缩级别（1~9）

6、流式压缩（高级用法）

7、tar 打包 + LZMA 高压缩

公告

想看山海，早也去，晚也去，一个人也去

1、Python 3.3+ 内置 `lzma` 模块，常用类和函数：

3、读写 `.xz` 文件（推荐格式）