1. 需求分析

1.1 需求明细

我们目前是解析通过mysqlbinlog工具解析后的文本文件，现在要求直接解析二进制的binlog文件。
我们可以使用python-mysql-replication库来直接解析二进制binlog。
安装：pip install mysql-replication

我们将重写这个程序，使用mysql-replication来解析binlog，并生成相同的统计报告。

步骤：

使用mysql-replication库读取二进制binlog文件。
遍历binlog中的事件，识别事务（BEGIN和COMMIT）以及DML事件（INSERT, UPDATE, DELETE）。
收集与之前相同的数据：事务大小（通过事件位置计算）、事务时长、事件数、DML操作类型、表名等。
生成相同的统计报告。

注意：二进制binlog中，一个事务由多个事件组成，包括GtidEvent、QueryEvent（BEGIN）、TableMapEvent、WriteRowsEvent/UpdateRowsEvent/DeleteRowsEvent、XidEvent（COMMIT）等。

我们假设binlog是ROW模式，因为 STATEMENT 或 MIXED 模式可能无法准确解析。

由于直接解析二进制binlog比较复杂，我们将逐步构建。

我们将创建一个新的类：BinaryBinlogAnalyzer，它使用mysql-replication库。

注意：mysql-replication库的安装可能需要依赖，如pymysql。

我们先写一个大致框架，然后逐步填充。

由于代码较长，我们分步骤实现：

导入必要的库
创建BinaryBinlogAnalyzer类
实现parse_file方法，使用mysql-replication的BinLogStreamReader来读取文件
遍历事件，识别事务开始和结束，并收集事件信息
计算事务的大小（通过事件的位置差）和时长（通过时间戳）
收集DML操作和表信息
生成相同的统计报告

由于我们之前已经有一个生成统计报告的函数，我们可以重用之前的统计报告生成代码，只需确保我们收集的数据结构相同。

注意：二进制binlog中可能包含其他事件，我们只关注事务性事件（特别是DML）。

我们假设binlog文件是完整的，并且包含所需的所有事件。

1.2 程序调佣顺序

binlog_analyzer_main.py (主控程序)
        ↓
    binlog_parser.py (数据解析核心)
        ↓
report_generator.py (报告生成器)

1.3 数据流向

二进制日志文件 (.000020)
        ↓
BinlogStreamReader (事件流)
        ↓
原始事件数据 (QueryEvent, WriteRowsEvent等)
        ↓
事务对象 (包含事件列表、时长、大小等)
        ↓
统计数据结构 (聚合所有事务和操作)
        ↓
格式化报告 (控制台 + HTML)

1.4 各文件详细说明

1. binlog_analyzer_main.py - 主控程序

职责：程序入口，协调整个分析流程

主要逻辑：

命令行参数解析和处理
用户交互（密码输入等）
调用解析器和报告生成器
错误处理和程序流程控制

关键执行步骤：

# 1. 解析命令行参数
args = parse_arguments()

# 2. 创建解析器并解析binlog文件
parser = BinlogParser(debug=args.debug)
parser.parse_binary_file(args.filename, mysql_config)

# 3. 获取统计结果
stats_data = parser.generate_statistics()

# 4. 生成报告
report_generator = ReportGenerator(stats_data)
report_generator.print_console_report()
report_generator.generate_html_report()

2. binlog_parser.py - 数据解析核心

职责：实际解析MySQL二进制日志文件，提取统计信息

主要逻辑：

使用MySQL连接或mysqlbinlog工具解析binlog
识别事务边界（BEGIN/COMMIT）
解析DML事件（INSERT/UPDATE/DELETE）
统计事件数、影响行数、事务时长等指标
按表和操作类型聚合数据

关键执行步骤：

# 1. 创建binlog流读取器
stream = BinLogStreamReader(connection_settings, log_file=binlog_file)

# 2. 遍历所有binlog事件
for binlogevent in stream:
    # 检测事务开始
    if isinstance(event, QueryEvent) and query == 'BEGIN':
        start_new_transaction()
    
    # 检测DML操作
    elif isinstance(event, (WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent)):
        process_dml_event(event)
    
    # 检测事务提交
    elif isinstance(event, XidEvent) or (isinstance(event, QueryEvent) and query == 'COMMIT'):
        finalize_transaction()

# 3. 生成统计汇总
generate_statistics()

3. report_generator.py - 报告生成器

职责：将统计数据转换为可读的报告格式

主要逻辑：

格式化控制台文本输出
生成HTML可视化报告
计算百分比和排名
数据验证和完整性检查

关键执行步骤：

# 1. 接收统计数据
def __init__(self, stats_data):
    self.stats = stats_data

# 2. 生成控制台报告
def print_console_report():
    # 输出事务TOP10排名
    # 输出总体统计
    # 输出DML操作统计
    # 输出表操作TOP10

# 3. 生成HTML报告
def generate_html_report():
    # 创建HTML模板
    # 填充统计数据
    # 生成可视化图表和表格

2. 代码实现

2.1 binlog_analyzer_main.py 主入口

#!/usr/bin/env python3
"""
MySQL二进制日志分析工具 - 主程序（根据新定义调整）
根据新定义：
1. 处理的事件总数 = 执行的SQL次数
2. 总DML操作数 = DML影响的行数
3. 找到的事务数 = 事务数量
"""

import sys
import argparse
from pathlib import Path
import getpass

# 导入自定义模块
from binlog_parser import BinlogParser
from report_generator import ReportGenerator


def get_password(prompt="请输入MySQL密码: "):
    """安全获取密码输入"""
    return getpass.getpass(prompt)


def main():
    parser = argparse.ArgumentParser(
        description='MySQL二进制日志分析工具 - 主程序',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
使用示例:
  # 基本用法
  python3 %(prog)s mysql-bin.000001 --host 127.0.0.1 --user root --port 3306

  # 启用调试模式
  python3 %(prog)s mysql-bin.000001 --host 127.0.0.1 --user root --debug

  # 不连接MySQL，使用mysqlbinlog工具解析
  python3 %(prog)s mysql-bin.000001 --no-mysql-connection

  # 从文件加载统计数据并生成报告
  python3 %(prog)s --load-stats stats.json --html report.html
        """
    )
    
    # 数据源选项组
    source_group = parser.add_argument_group('数据源选项')
    source_exclusive = source_group.add_mutually_exclusive_group(required=True)
    source_exclusive.add_argument('filename', nargs='?', help='要分析的二进制binlog文件路径')
    source_exclusive.add_argument('--load-stats', help='从JSON文件加载统计数据')
    
    # MySQL连接参数
    mysql_group = parser.add_argument_group('MySQL连接参数')
    mysql_group.add_argument('--host', help='MySQL服务器地址（产生binlog的实例）')
    mysql_group.add_argument('--port', type=int, default=3306, help='MySQL服务器端口 (默认: 3306)')
    mysql_group.add_argument('--user', help='MySQL用户名')
    mysql_group.add_argument('--password', help='MySQL密码（不推荐在命令行输入）')
    mysql_group.add_argument('--password-file', help='包含MySQL密码的文件路径')
    mysql_group.add_argument('--server-id', type=int, default=1, help='服务器ID (默认: 1)')
    mysql_group.add_argument('--no-mysql-connection', action='store_true', 
                           help='不连接MySQL，使用mysqlbinlog工具解析（备用方案）')
    
    # 调试和输出选项
    output_group = parser.add_argument_group('输出选项')
    output_group.add_argument('--html', help='生成HTML报告文件路径')
    output_group.add_argument('--save-stats', help='将统计数据保存为JSON文件')
    output_group.add_argument('--start-pos', type=int, default=4, help='起始位置 (默认: 4)')
    output_group.add_argument('--no-console', action='store_true', help='不在控制台显示报告')
    output_group.add_argument('--debug', action='store_true', help='启用调试模式，显示详细解析信息')
    
    args = parser.parse_args()
    
    stats_data = None
    
    # 处理数据源
    if args.load_stats:
        stats_data = BinlogParser.load_statistics(args.load_stats)
        if not stats_data:
            sys.exit(1)
    else:
        if not Path(args.filename).exists():
            print(f"错误: 文件 {args.filename} 不存在")
            sys.exit(1)
        
        # 创建解析器，启用调试模式
        parser = BinlogParser(debug=args.debug)
        
        if args.no_mysql_connection:
            print("⚠️  使用mysqlbinlog工具解析（不连接MySQL）")
            if not parser.parse_binary_file(args.filename):
                sys.exit(1)
        else:
            if not all([args.host, args.user]):
                print("错误: 使用MySQL连接解析需要提供 --host 和 --user 参数")
                print("或者使用 --no-mysql-connection 选项使用mysqlbinlog工具解析")
                sys.exit(1)
            
            password = None
            if args.password:
                password = args.password
            elif args.password_file:
                try:
                    with open(args.password_file, 'r') as f:
                        password = f.read().strip()
                except Exception as e:
                    print(f"读取密码文件失败: {e}")
                    sys.exit(1)
            else:
                password = get_password()
            
            if not password:
                print("错误: 必须提供MySQL密码")
                sys.exit(1)
            
            mysql_config = {
                'host': args.host,
                'port': args.port,
                'user': args.user,
                'passwd': password
            }
            
            if not parser.parse_binary_file(args.filename, mysql_config, args.start_pos, args.server_id):
                sys.exit(1)
        
        stats_data = parser.generate_statistics()
        
        if args.save_stats:
            parser.save_statistics(args.save_stats)
    
    # 生成报告
    if stats_data:
        report_generator = ReportGenerator(stats_data)
        
        if 'parser_mode' in stats_data:
            print(f"\n📊 解析模式: {stats_data['parser_mode']}")
            if stats_data['parser_mode'] == 'text':
                print("💡 提示: 使用mysqlbinlog工具解析，某些统计可能不如MySQL连接解析准确")
        
        if not args.no_console:
            report_generator.print_console_report()
        
        if args.html:
            report_generator.generate_html_report(args.html)
        elif not args.load_stats:
            report_generator.generate_html_report()
    else:
        print("错误: 没有可用的统计数据")
        sys.exit(1)


if __name__ == "__main__":
    main()

View Code

2.2 binlog_parser.py 解析二进制日志文件

#!/usr/bin/env python3
"""
MySQL二进制日志解析器 - 数据获取模块（使用事务起始位置作为事务ID）
"""

import datetime
import os
import re
from collections import defaultdict
import getpass
import subprocess
import tempfile
import json

try:
    from pymysqlreplication import BinLogStreamReader
    from pymysqlreplication.event import *
    from pymysqlreplication.row_event import *
except ImportError:
    print("警告: 未安装mysql-replication库，部分功能将不可用")
    print("请运行: pip install mysql-replication pymysql")


class BinlogParser:
    def __init__(self, debug=False):
        self.transactions = []
        self.current_transaction = None
        self.stats = {
            'total_transactions': 0,
            'total_events': 0,           # 总事件数（binlog事件）
            'total_dml_rows': 0,         # DML影响总行数
            'total_size': 0,
            'total_duration_ms': 0,
            'dml_operations': defaultdict(lambda: {
                'events_count': 0,       # DML事件数
                'rows_affected': 0,      # 影响行数
                'transactions_count': 0  # 涉及事务数
            }),
            'table_operations': defaultdict(lambda: defaultdict(int)),
            'transaction_timestamps': []
        }
        self.parser_mode = "unknown"
        self.debug = debug
        self.event_count = 0
        self.last_timestamp = None
        self.last_position = 0
        self.start_position = 4  # 默认起始位置
        
    def parse_binary_file(self, file_path, mysql_config=None, start_position=4, server_id=1, only_events=None):
        """
        解析二进制binlog文件
        """
        self.start_position = start_position
        
        if mysql_config:
            return self._parse_with_mysql_connection(file_path, mysql_config, start_position, server_id, only_events)
        else:
            return self._parse_with_mysqlbinlog(file_path)
    
    def _parse_with_mysql_connection(self, file_path, mysql_config, start_position=4, server_id=1, only_events=None):
        """使用MySQL连接解析二进制文件"""
        print(f"正在使用MySQL连接解析二进制binlog文件: {file_path}")
        print(f"MySQL连接信息: {mysql_config['host']}:{mysql_config['port']} (用户: {mysql_config['user']})")
        print(f"起始解析位置: {start_position}")
        
        try:
            if not os.path.exists(file_path):
                print(f"错误: binlog文件不存在: {file_path}")
                return False
            
            # 重置统计信息
            self._reset_statistics()
            
            # 创建binlog流读取器
            stream = BinLogStreamReader(
                connection_settings=mysql_config,
                server_id=server_id,
                log_file=os.path.basename(file_path),
                log_pos=start_position,
                resume_stream=False,
                blocking=False,
                only_events=only_events or [
                    QueryEvent,
                    WriteRowsEvent,
                    UpdateRowsEvent,
                    DeleteRowsEvent,
                    XidEvent,
                    GtidEvent,
                    FormatDescriptionEvent,
                    RotateEvent
                ],
                freeze_schema=True
            )
            
            transaction_start_time = None
            transaction_start_pos = 0
            current_events = []
            current_dml_ops = defaultdict(int)
            current_dml_rows = defaultdict(int)  # 记录影响行数
            current_tables = set()
            self.event_count = 0
            skip_events_before_start_pos = True  # 标记是否在跳过起始位置之前的事件
            
            print("开始解析binlog事件...")
            
            for binlogevent in stream:
                self.event_count += 1
                event_time = binlogevent.timestamp
                event_position = binlogevent.packet.log_pos
                
                # 跳过起始位置之前的事件
                if skip_events_before_start_pos and event_position < start_position:
                    if self.debug:
                        print(f"跳过位置 {event_position} 的事件（在起始位置 {start_position} 之前）")
                    continue
                else:
                    skip_events_before_start_pos = False
                
                if self.debug:
                    print(f"事件 #{self.event_count}: {type(binlogevent).__name__} at position {event_position}")
                
                # 检测事务开始 (BEGIN)
                if isinstance(binlogevent, QueryEvent):
                    query = binlogevent.query
                    if isinstance(query, bytes):
                        query = query.decode('utf-8')
                    
                    if self.debug:
                        print(f"  Query: {query}")
                    
                    if query.upper() == 'BEGIN':
                        transaction_start_time = datetime.datetime.fromtimestamp(event_time)
                        transaction_start_pos = event_position
                        current_events = []
                        current_dml_ops = defaultdict(int)
                        current_dml_rows = defaultdict(int)
                        current_tables = set()
                        if self.debug:
                            print(f"  检测到事务开始 (BEGIN) at position {event_position}")
                        continue
                
                # 检测DML操作
                dml_event = None
                table_name = None
                row_count = 0
                
                if isinstance(binlogevent, WriteRowsEvent):
                    dml_event = 'INSERT'
                    schema = binlogevent.schema
                    table = binlogevent.table
                    if isinstance(schema, bytes):
                        schema = schema.decode('utf-8')
                    if isinstance(table, bytes):
                        table = table.decode('utf-8')
                    table_name = f"{schema}.{table}"
                    row_count = len(binlogevent.rows) if hasattr(binlogevent, 'rows') else 1
                    if self.debug:
                        print(f"  检测到INSERT操作: {table_name}, 行数: {row_count}")
                        
                elif isinstance(binlogevent, UpdateRowsEvent):
                    dml_event = 'UPDATE' 
                    schema = binlogevent.schema
                    table = binlogevent.table
                    if isinstance(schema, bytes):
                        schema = schema.decode('utf-8')
                    if isinstance(table, bytes):
                        table = table.decode('utf-8')
                    table_name = f"{schema}.{table}"
                    row_count = len(binlogevent.rows) if hasattr(binlogevent, 'rows') else 1
                    if self.debug:
                        print(f"  检测到UPDATE操作: {table_name}, 行数: {row_count}")
                        
                elif isinstance(binlogevent, DeleteRowsEvent):
                    dml_event = 'DELETE'
                    schema = binlogevent.schema
                    table = binlogevent.table
                    if isinstance(schema, bytes):
                        schema = schema.decode('utf-8')
                    if isinstance(table, bytes):
                        table = table.decode('utf-8')
                    table_name = f"{schema}.{table}"
                    row_count = len(binlogevent.rows) if hasattr(binlogevent, 'rows') else 1
                    if self.debug:
                        print(f"  检测到DELETE操作: {table_name}, 行数: {row_count}")
                
                if dml_event and table_name and transaction_start_time is not None:
                    current_events.append({
                        'type': dml_event,
                        'table': table_name,
                        'row_count': row_count
                    })
                    current_dml_ops[dml_event] += 1  # 事件数+1
                    current_dml_rows[dml_event] += row_count  # 影响行数累加
                    current_tables.add(table_name)
                    
                    # 更新表操作统计
                    self.stats['table_operations'][table_name][dml_event] += row_count
                
                # 检测事务提交 (XID事件或COMMIT查询)
                transaction_committed = False
                
                if isinstance(binlogevent, XidEvent):
                    if self.debug:
                        print("  检测到XID事件（事务提交）")
                    transaction_committed = True
                    
                elif isinstance(binlogevent, QueryEvent):
                    query = binlogevent.query
                    if isinstance(query, bytes):
                        query = query.decode('utf-8')
                    if query.upper() == 'COMMIT':
                        if self.debug:
                            print("  检测到COMMIT查询")
                        transaction_committed = True
                
                if transaction_committed:
                    if transaction_start_time and current_events:
                        transaction_end_time = datetime.datetime.fromtimestamp(event_time)
                        transaction_size = event_position - transaction_start_pos
                        transaction_duration = (transaction_end_time - transaction_start_time).total_seconds() * 1000
                        
                        # 使用事务起始位置作为事务ID，而不是自增的数字
                        transaction_data = {
                            'transaction_id': transaction_start_pos,  # 使用起始位置作为事务ID
                            'start_time': transaction_start_time,
                            'end_time': transaction_end_time,
                            'start_pos': transaction_start_pos,
                            'end_pos': event_position,
                            'size': transaction_size,
                            'duration_ms': transaction_duration,
                            'events': current_events.copy(),
                            'event_count': len(current_events),
                            'dml_operations': current_dml_ops.copy(),
                            'dml_rows': current_dml_rows.copy(),  # 影响行数
                            'tables_affected': current_tables.copy()
                        }
                        
                        self.transactions.append(transaction_data)
                        
                        # 更新统计信息
                        self._update_statistics(transaction_data)
                        
                        if self.debug:
                            print(f"  完成事务 {transaction_start_pos}: {len(current_events)} 个事件, 大小: {transaction_size} bytes")
                    
                    # 重置事务状态
                    transaction_start_time = None
                    current_events = []
                    current_dml_ops = defaultdict(int)
                    current_dml_rows = defaultdict(int)
                    current_tables = set()
            
            stream.close()
            
            # 输出解析统计
            print(f"解析统计:")
            print(f"  - 处理的事件总数: {self.event_count}")
            print(f"  - 找到的事务数: {len(self.transactions)}")
            print(f"  - 总DML操作数: {self.stats['total_dml_rows']}")  # 修改为影响行数
            
            if len(self.transactions) == 0 and self.event_count > 0:
                print("警告: 处理了事件但没有找到完整的事务")
                print("可能的原因:")
                print("  1. binlog格式不是ROW格式")
                print("  2. 事务边界识别失败")
                print("  3. 选择的binlog文件不包含相关操作")
                print("  4. 事务可能没有使用BEGIN/COMMIT，而是自动提交")
            
            self.parser_mode = "binary"
            return len(self.transactions) > 0
            
        except Exception as e:
            print(f"❌ MySQL连接解析失败: {e}")
            import traceback
            traceback.print_exc()
            print("⚠️  尝试使用mysqlbinlog工具解析...")
            return self._parse_with_mysqlbinlog(file_path, start_position)
    
    def _reset_statistics(self):
        """重置统计信息"""
        self.transactions = []
        self.stats = {
            'total_transactions': 0,
            'total_events': 0,
            'total_dml_rows': 0,
            'total_size': 0,
            'total_duration_ms': 0,
            'dml_operations': defaultdict(lambda: {
                'events_count': 0,
                'rows_affected': 0,
                'transactions_count': 0
            }),
            'table_operations': defaultdict(lambda: defaultdict(int)),
            'transaction_timestamps': []
        }
        self.event_count = 0
    
    def _parse_with_mysqlbinlog(self, file_path, start_position=4):
        """使用mysqlbinlog工具解析binlog文件"""
        print(f"正在使用mysqlbinlog工具解析文件: {file_path}")
        print(f"起始解析位置: {start_position}")
        
        try:
            subprocess.run(['mysqlbinlog', '--version'], capture_output=True, check=True)
        except (subprocess.CalledProcessError, FileNotFoundError):
            print("❌ 未找到mysqlbinlog工具，请确保MySQL客户端已安装")
            return False
        
        try:
            # 重置统计信息
            self._reset_statistics()
            
            # 使用--start-position参数
            cmd = ['mysqlbinlog', '--base64-output=DECODE-ROWS', '-v', 
                   '--start-position', str(start_position), file_path]
            
            result = subprocess.run(cmd, capture_output=True, text=True, encoding='utf-8')
            
            if result.returncode != 0:
                print(f"❌ mysqlbinlog执行失败: {result.stderr}")
                return False
                
            lines = result.stdout.split('\n')
            success = self._parse_mysqlbinlog_output(lines)
            
            if success:
                self.parser_mode = "text"
                print(f"✅ mysqlbinlog解析完成，找到 {len(self.transactions)} 个事务")
            else:
                print("❌ mysqlbinlog解析失败")
                
            return success
            
        except Exception as e:
            print(f"❌ mysqlbinlog解析出错: {e}")
            return False
    
    def _parse_mysqlbinlog_output(self, lines):
        """解析mysqlbinlog的输出"""
        print(f"开始解析mysqlbinlog输出，共 {len(lines)} 行")
        
        skip_events_before_start_pos = True  # 标记是否在跳过起始位置之前的事件
        
        for i, line in enumerate(lines):
            line = line.strip()
            
            # 解析位置和时间戳信息
            self._parse_position_and_timestamp(line)
            
            # 跳过起始位置之前的事件
            if skip_events_before_start_pos and self.last_position is not None:
                if self.last_position < self.start_position:
                    continue
                else:
                    skip_events_before_start_pos = False
            
            # 检测事务开始
            if line == 'BEGIN' and self.current_transaction is None:
                self.current_transaction = {
                    'start_pos': self.last_position,
                    'start_time': self.last_timestamp,
                    'events': [],
                    'dml_operations': defaultdict(int),
                    'dml_rows': defaultdict(int),
                    'tables_affected': set(),
                    'transaction_id': self.last_position  # 使用起始位置作为事务ID
                }
                if self.debug:
                    print(f"检测到事务开始 at position {self.last_position}")
                
            # 检测事务提交
            elif line.startswith('COMMIT') and self.current_transaction:
                self.current_transaction['end_pos'] = self.last_position
                self.current_transaction['end_time'] = self.last_timestamp
                self._finalize_transaction()
                if self.debug:
                    print(f"检测到事务提交，完成事务 {self.current_transaction['transaction_id']}")
                
            # 检测DML操作
            elif line.startswith('###') and self.current_transaction:
                self._parse_dml_operation_from_text(line)
        
        print(f"解析完成: 处理了 {len(lines)} 行，找到 {len(self.transactions)} 个事务")
        return len(self.transactions) > 0
    
    def _parse_position_and_timestamp(self, line):
        """解析位置和时间戳信息"""
        # 解析位置信息
        if line.startswith('# at '):
            try:
                self.last_position = int(line.split()[2])
            except (IndexError, ValueError):
                pass
                
        # 解析时间戳信息
        timestamp_match = re.search(r'#(\d{6}\s+\d+:\d+:\d+)', line)
        if timestamp_match:
            timestamp_str = timestamp_match.group(1)
            try:
                # 格式: "251028 6:00:00" 转换为 "2025-10-28 06:00:00"
                date_part = timestamp_str.split()[0]
                time_part = timestamp_str.split()[1]
                
                # 添加年份前缀
                year_prefix = "20" if date_part[:2] <= "50" else "19"
                full_date = f"{year_prefix}{date_part[:2]}-{date_part[2:4]}-{date_part[4:6]}"
                
                # 确保时间格式正确
                if ':' in time_part and time_part.count(':') == 2:
                    time_parts = time_part.split(':')
                    time_part = f"{time_parts[0].zfill(2)}:{time_parts[1]}:{time_parts[2]}"
                
                datetime_str = f"{full_date} {time_part}"
                self.last_timestamp = datetime.datetime.strptime(datetime_str, "%Y-%m-%d %H:%M:%S")
            except ValueError:
                try:
                    self.last_timestamp = datetime.datetime.strptime(timestamp_str, "%y%m%d %H:%M:%S")
                except ValueError:
                    pass
    
    def _parse_dml_operation_from_text(self, line):
        """从文本行解析DML操作"""
        if not self.current_transaction:
            return
            
        # 检测DML操作类型
        if 'INSERT INTO' in line:
            op_type = 'INSERT'
        elif 'UPDATE' in line and not line.startswith('### WHERE') and not line.startswith('### SET'):
            op_type = 'UPDATE'
        elif 'DELETE FROM' in line:
            op_type = 'DELETE'
        else:
            return
        
        # 提取表名
        full_table_name = None
        
        # INSERT语句: INSERT INTO `db`.`table`
        if op_type == 'INSERT':
            table_match = re.search(r'INTO `([^`]+)`\.`([^`]+)`', line)
            if table_match:
                database = table_match.group(1)
                table = table_match.group(2)
                full_table_name = f"{database}.{table}"
        
        # UPDATE语句: UPDATE `db`.`table`
        elif op_type == 'UPDATE':
            table_match = re.search(r'UPDATE `([^`]+)`\.`([^`]+)`', line)
            if table_match:
                database = table_match.group(1)
                table = table_match.group(2)
                full_table_name = f"{database}.{table}"
        
        # DELETE语句: DELETE FROM `db`.`table`
        elif op_type == 'DELETE':
            table_match = re.search(r'FROM `([^`]+)`\.`([^`]+)`', line)
            if table_match:
                database = table_match.group(1)
                table = table_match.group(2)
                full_table_name = f"{database}.{table}"
        
        if full_table_name:
            self.current_transaction['events'].append({
                'type': op_type,
                'table': full_table_name,
                'row_count': 1  # 文本解析无法准确获取行数，默认为1
            })
            
            self.current_transaction['dml_operations'][op_type] += 1  # 事件数+1
            self.current_transaction['dml_rows'][op_type] += 1  # 影响行数+1（文本解析无法区分）
            self.current_transaction['tables_affected'].add(full_table_name)
            
            # 更新表操作统计
            self.stats['table_operations'][full_table_name][op_type] += 1
    
    def _finalize_transaction(self):
        """完成事务处理"""
        if (self.current_transaction and 
            self.current_transaction.get('start_time') and 
            self.current_transaction.get('end_time') and
            self.current_transaction.get('start_pos') is not None and
            self.current_transaction.get('end_pos') is not None):
            
            # 计算事务大小和时长
            start_pos = self.current_transaction['start_pos']
            end_pos = self.current_transaction['end_pos']
            
            # 确保大小为正数
            self.current_transaction['size'] = abs(end_pos - start_pos)
            
            start_time = self.current_transaction['start_time']
            end_time = self.current_transaction['end_time']
            
            # 计算毫秒级时长
            duration_ms = (end_time - start_time).total_seconds() * 1000
            self.current_transaction['duration_ms'] = max(duration_ms, 0)
            
            self.current_transaction['event_count'] = len(self.current_transaction['events'])
            
            self.transactions.append(self.current_transaction)
            
            # 更新统计信息
            self._update_statistics(self.current_transaction)
        
        self.current_transaction = None
    
    def _update_statistics(self, transaction):
        """更新统计信息"""
        self.stats['total_transactions'] += 1
        self.stats['total_events'] += transaction['event_count']
        self.stats['total_size'] += transaction['size']
        self.stats['total_duration_ms'] += transaction['duration_ms']
        self.stats['transaction_timestamps'].append(transaction['start_time'])
        
        # 更新DML操作统计（事件数和影响行数）
        for op_type, event_count in transaction['dml_operations'].items():
            self.stats['dml_operations'][op_type]['events_count'] += event_count
            self.stats['dml_operations'][op_type]['transactions_count'] += 1
        
        # 更新DML影响行数统计
        for op_type, row_count in transaction.get('dml_rows', {}).items():
            self.stats['dml_operations'][op_type]['rows_affected'] += row_count
            self.stats['total_dml_rows'] += row_count
    
    def generate_statistics(self):
        """生成统计报告数据"""
        if not self.transactions:
            return {'error': '没有找到事务数据', 'total_transactions': 0}
        
        # 直接从事务数据重新计算统计，确保数据准确
        stats = {
            'total_transactions': len(self.transactions),
            'total_events': sum(t['event_count'] for t in self.transactions),
            'total_dml_rows': sum(sum(t.get('dml_rows', {}).values()) for t in self.transactions),
            'total_size': sum(t['size'] for t in self.transactions),
            'total_duration_ms': sum(t['duration_ms'] for t in self.transactions),
            'dml_operations': defaultdict(lambda: {'events_count': 0, 'rows_affected': 0, 'transactions_count': 0}),
            'table_operations': dict(self.stats['table_operations']),  # 直接使用实时统计的表操作数据
            'transaction_timestamps': [t['start_time'] for t in self.transactions],
            'parser_mode': self.parser_mode,
            'start_position': self.start_position  # 记录起始位置
        }
        
        # 重新计算DML操作统计
        for transaction in self.transactions:
            for op_type, event_count in transaction['dml_operations'].items():
                stats['dml_operations'][op_type]['events_count'] += event_count
                stats['dml_operations'][op_type]['transactions_count'] += 1
            
            for op_type, row_count in transaction.get('dml_rows', {}).items():
                stats['dml_operations'][op_type]['rows_affected'] += row_count
        
        # 计算平均值
        if stats['total_transactions'] > 0:
            stats['avg_duration_ms'] = stats['total_duration_ms'] / stats['total_transactions']
            stats['avg_size'] = stats['total_size'] / stats['total_transactions']
            stats['avg_events'] = stats['total_events'] / stats['total_transactions']
            stats['avg_dml_rows'] = stats['total_dml_rows'] / stats['total_transactions']
        else:
            stats['avg_duration_ms'] = 0
            stats['avg_size'] = 0
            stats['avg_events'] = 0
            stats['avg_dml_rows'] = 0
        
        # 生成TOP10排名
        stats['top10_transaction_sizes'] = sorted(
            [(t['size'], t['transaction_id'], t['duration_ms'], t['event_count'], 
              sum(t.get('dml_rows', {}).values())) for t in self.transactions],
            key=lambda x: x[0], reverse=True
        )[:10]
        
        stats['top10_transaction_durations'] = sorted(
            [(t['duration_ms'], t['transaction_id'], t['size'], t['event_count'],
              sum(t.get('dml_rows', {}).values())) for t in self.transactions],
            key=lambda x: x[0], reverse=True
        )[:10]
        
        stats['top10_transaction_events'] = sorted(
            [(t['event_count'], t['transaction_id'], t['size'], t['duration_ms'],
              sum(t.get('dml_rows', {}).values())) for t in self.transactions],
            key=lambda x: x[0], reverse=True
        )[:10]
        
        # 事务影响行数TOP10
        stats['top10_transaction_rows'] = sorted(
            [(sum(t.get('dml_rows', {}).values()), t['transaction_id'], t['size'], 
              t['duration_ms'], t['event_count']) for t in self.transactions],
            key=lambda x: x[0], reverse=True
        )[:10]
        
        # 生成表操作TOP10 - 直接从实时统计中获取
        insert_tables = []
        update_tables = []
        delete_tables = []
        
        for table, ops in stats['table_operations'].items():
            if 'INSERT' in ops and ops['INSERT'] > 0:
                insert_tables.append((table, ops['INSERT']))
            if 'UPDATE' in ops and ops['UPDATE'] > 0:
                update_tables.append((table, ops['UPDATE']))
            if 'DELETE' in ops and ops['DELETE'] > 0:
                delete_tables.append((table, ops['DELETE']))
        
        stats['top10_insert_tables'] = sorted(insert_tables, key=lambda x: x[1], reverse=True)[:10]
        stats['top10_update_tables'] = sorted(update_tables, key=lambda x: x[1], reverse=True)[:10]
        stats['top10_delete_tables'] = sorted(delete_tables, key=lambda x: x[1], reverse=True)[:10]
        
        # 时间范围
        if stats['transaction_timestamps']:
            stats['time_range'] = {
                'start': min(stats['transaction_timestamps']),
                'end': max(stats['transaction_timestamps'])
            }
            time_span = stats['time_range']['end'] - stats['time_range']['start']
            stats['time_span_seconds'] = time_span.total_seconds()
            stats['time_span_hours'] = time_span.total_seconds() / 3600
        else:
            stats['time_range'] = {'start': None, 'end': None}
            stats['time_span_seconds'] = 0
            stats['time_span_hours'] = 0
        
        return stats

    def save_statistics(self, filename):
        """将统计数据保存为JSON文件"""
        stats = self.generate_statistics()
        
        # 将datetime对象转换为字符串
        def datetime_serializer(obj):
            if isinstance(obj, datetime.datetime):
                return obj.isoformat()
            raise TypeError(f"Object of type {type(obj)} is not JSON serializable")
        
        try:
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(stats, f, indent=2, default=datetime_serializer, ensure_ascii=False)
            print(f"✅ 统计数据已保存到: {filename}")
            return True
        except Exception as e:
            print(f"❌ 保存统计数据失败: {e}")
            return False

    @staticmethod
    def load_statistics(filename):
        """从JSON文件加载统计数据"""
        try:
            with open(filename, 'r', encoding='utf-8') as f:
                stats = json.load(f)
            
            # 将字符串转换回datetime对象
            if 'time_range' in stats and stats['time_range']['start']:
                stats['time_range']['start'] = datetime.datetime.fromisoformat(stats['time_range']['start'])
                stats['time_range']['end'] = datetime.datetime.fromisoformat(stats['time_range']['end'])
            
            print(f"✅ 统计数据已从文件加载: {filename}")
            return stats
        except Exception as e:
            print(f"❌ 加载统计数据失败: {e}")
            return None

View Code

2.3 report_generator.py 生成报告

#!/usr/bin/env python3
"""
MySQL二进制日志报告生成器 - 数据展示模块
根据新定义调整输出格式
"""

import datetime
from pathlib import Path


class ReportGenerator:
    def __init__(self, stats_data):
        self.stats = stats_data
    
    def print_console_report(self):
        """打印控制台统计报告"""
        if self.stats.get('error'):
            print(f"\n❌ {self.stats['error']}")
            return
        
        print("\n" + "=" * 80)
        print("🎯 MySQL二进制日志分析报告 (直接解析)")
        print("=" * 80)
        
        # 时间范围
        if 'time_range' in self.stats and self.stats['time_range']['start']:
            print(f"📅 时间范围: {self.stats['time_range']['start']} 到 {self.stats['time_range']['end']}")
            print(f"⏳ 时间跨度: {self.stats['time_span_hours']:.2f} 小时 ({self.stats['time_span_seconds']:.0f} 秒)")
        
        # 1. 事务大小TOP10
        print(f"\n🏆 事务大小 TOP10 (字节):")
        print("─" * 90)
        if self.stats['top10_transaction_sizes']:
            for i, (size, trans_id, duration_ms, event_count, row_count) in enumerate(self.stats['top10_transaction_sizes'], 1):
                print(f"   #{i:2d} 事务 {trans_id:3d}: {size:>12,} bytes | 时长: {duration_ms:>8.3f} ms | 事件数: {event_count:>4} | 影响行数: {row_count:>8}")
        else:
            print("   📝 无数据")
        
        # 2. 事务时长TOP10
        print(f"\n⏱️  事务时长 TOP10 (毫秒):")
        print("─" * 90)
        if self.stats['top10_transaction_durations']:
            for i, (duration_ms, trans_id, size, event_count, row_count) in enumerate(self.stats['top10_transaction_durations'], 1):
                print(f"   #{i:2d} 事务 {trans_id:3d}: {duration_ms:>12.3f} ms | 大小: {size:>10,} bytes | 事件数: {event_count:>4} | 影响行数: {row_count:>8}")
        else:
            print("   📝 无数据")
        
        # 3. 事务事件数TOP10
        print(f"\n📊 事务事件数 TOP10:")
        print("─" * 90)
        if self.stats['top10_transaction_events']:
            for i, (event_count, trans_id, size, duration_ms, row_count) in enumerate(self.stats['top10_transaction_events'], 1):
                print(f"   #{i:2d} 事务 {trans_id:3d}: {event_count:>8} 个事件 | 大小: {size:>10,} bytes | 时长: {duration_ms:>8.3f} ms | 影响行数: {row_count:>8}")
        else:
            print("   📝 无数据")
        
        # 4. 事务影响行数TOP10
        print(f"\n📈 事务影响行数 TOP10:")
        print("─" * 90)
        if self.stats['top10_transaction_rows']:
            for i, (row_count, trans_id, size, duration_ms, event_count) in enumerate(self.stats['top10_transaction_rows'], 1):
                print(f"   #{i:2d} 事务 {trans_id:3d}: {row_count:>8,} 行 | 大小: {size:>10,} bytes | 时长: {duration_ms:>8.3f} ms | 事件数: {event_count:>4}")
        else:
            print("   📝 无数据")
        
        # 5. 总体事务统计
        print(f"\n📈 总体事务统计:")
        print("─" * 50)
        print(f"   📦 事务总数:       {self.stats['total_transactions']:>8}")
        print(f"   📊 总事件数:       {self.stats['total_events']:>8}")
        print(f"   📈 总DML影响行数:  {self.stats.get('total_dml_rows', 0):>8}")
        print(f"   ⏱️  总时长:        {self.stats['total_duration_ms']:>12.3f} ms")
        print(f"   💾 总大小:         {self.stats['total_size']:>12,} bytes")
        print(f"   📊 平均时长:       {self.stats['avg_duration_ms']:>12.3f} ms")
        print(f"   📏 平均大小:       {self.stats['avg_size']:>12,.0f} bytes")
        print(f"   🔢 平均事件数:     {self.stats['avg_events']:>10.1f}")
        print(f"   📈 平均影响行数:   {self.stats.get('avg_dml_rows', 0):>10.0f}")
        
        # 6. DML操作统计
        print(f"\n🛠️  DML操作统计:")
        print("─" * 50)
        if self.stats['dml_operations']:
            dml_total_events = sum(op_stats['events_count'] for op_stats in self.stats['dml_operations'].values())
            dml_total_rows = sum(op_stats['rows_affected'] for op_stats in self.stats['dml_operations'].values())
            
            print(f"   🔍 事件数验证: DML事件总数 = {dml_total_events}, 总事件数 = {self.stats['total_events']}")
            print(f"   🔍 行数验证: DML影响行数 = {dml_total_rows}, 总DML影响行数 = {self.stats.get('total_dml_rows', 0)}")
            
            for op_type, op_stats in self.stats['dml_operations'].items():
                if op_stats['events_count'] > 0 or op_stats['rows_affected'] > 0:
                    event_percent = (op_stats['events_count'] / dml_total_events * 100) if dml_total_events > 0 else 0
                    row_percent = (op_stats['rows_affected'] / dml_total_rows * 100) if dml_total_rows > 0 else 0
                    trans_percent = (op_stats['transactions_count'] / self.stats['total_transactions'] * 100) if self.stats['total_transactions'] > 0 else 0
                    
                    print(f"   {op_type:>6}:")
                    print(f"        DML事件数:     {op_stats['events_count']:>8} ({event_percent:>5.1f}%)")
                    print(f"        DML影响行数:   {op_stats['rows_affected']:>8} ({row_percent:>5.1f}%)")
                    print(f"        涉及事务数:    {op_stats['transactions_count']:>8} ({trans_percent:>5.1f}%)")
        else:
            print("   📝 无DML操作")
        
        # 7. 表DML操作TOP10
        print(f"\n📥 INSERT操作 TOP10 表:")
        print("─" * 50)
        if self.stats.get('top10_insert_tables'):
            for i, (table, row_count) in enumerate(self.stats['top10_insert_tables'], 1):
                print(f"   #{i:2d} {table:<40} {row_count:>8} 行操作")
        else:
            print("   📝 无INSERT操作")
        
        print(f"\n📤 DELETE操作 TOP10 表:")
        print("─" * 50)
        if self.stats.get('top10_delete_tables'):
            for i, (table, row_count) in enumerate(self.stats['top10_delete_tables'], 1):
                print(f"   #{i:2d} {table:<40} {row_count:>8} 行操作")
        else:
            print("   📝 无DELETE操作")
        
        print(f"\n🔄 UPDATE操作 TOP10 表:")
        print("─" * 50)
        if self.stats.get('top10_update_tables'):
            for i, (table, row_count) in enumerate(self.stats['top10_update_tables'], 1):
                print(f"   #{i:2d} {table:<40} {row_count:>8} 行操作")
        else:
            print("   📝 无UPDATE操作")
        
        print("\n" + "=" * 80)
        print("✅ 二进制日志分析完成")
    
    def generate_html_report(self, output_file=None):
        """生成HTML格式的报告"""
        if not output_file:
            output_file = f"binlog_report_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.html"
        
        html_content = f"""
        <!DOCTYPE html>
        <html lang="zh-CN">
        <head>
            <meta charset="UTF-8">
            <meta name="viewport" content="width=device-width, initial-scale=1.0">
            <title>MySQL二进制日志分析报告</title>
            <style>
                body {{ font-family: Arial, sans-serif; margin: 20px; background-color: #f5f5f5; }}
                .container {{ max-width: 1200px; margin: 0 auto; background: white; padding: 20px; border-radius: 10px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); }}
                .header {{ text-align: center; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 8px; margin-bottom: 30px; }}
                .section {{ margin-bottom: 30px; padding: 20px; border: 1px solid #e0e0e0; border-radius: 8px; background: #fafafa; }}
                .section-title {{ color: #333; border-bottom: 2px solid #667eea; padding-bottom: 10px; margin-bottom: 15px; }}
                table {{ width: 100%; border-collapse: collapse; margin-top: 10px; }}
                th, td {{ padding: 12px; text-align: left; border-bottom: 1px solid #ddd; }}
                th {{ background-color: #667eea; color: white; }}
                tr:nth-child(even) {{ background-color: #f8f9fa; }}
                tr:hover {{ background-color: #e9ecef; }}
                .metric {{ display: inline-block; margin: 10px; padding: 15px; background: white; border-radius: 8px; box-shadow: 0 2px 5px rgba(0,0,0,0.1); min-width: 150px; text-align: center; }}
                .metric-value {{ font-size: 24px; font-weight: bold; color: #667eea; }}
                .metric-label {{ font-size: 14px; color: #666; }}
                .top10-table {{ font-size: 14px; }}
                .timestamp {{ color: #666; font-size: 12px; text-align: right; }}
                .note {{ background: #fff3cd; border: 1px solid #ffeaa7; border-radius: 4px; padding: 10px; margin: 10px 0; }}
                .validation {{ background: #d1ecf1; border: 1px solid #bee5eb; border-radius: 4px; padding: 10px; margin: 10px 0; }}
            </style>
        </head>
        <body>
            <div class="container">
                <div class="header">
                    <h1>📊 MySQL二进制日志分析报告</h1>
                    <p>直接解析模式 | 生成时间: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
                </div>
        """
        
        if self.stats.get('error'):
            html_content += f"""
                <div class="section">
                    <h2 class="section-title">❌ 错误</h2>
                    <p>{self.stats['error']}</p>
                </div>
            """
        else:
            # 总体指标
            html_content += """
                <div class="section">
                    <h2 class="section-title">📈 总体指标</h2>
                    <div style="text-align: center;">
            """
            
            metrics = [
                (self.stats['total_transactions'], '事务总数'),
                (self.stats['total_events'], '总事件数'),
                (self.stats.get('total_dml_rows', 0), '总DML影响行数'),
                (f"{self.stats['total_duration_ms']:.2f}ms", '总时长'),
                (f"{self.stats['total_size']:,}", '总大小(bytes)')
            ]
            
            for value, label in metrics:
                html_content += f"""
                    <div class="metric">
                        <div class="metric-value">{value}</div>
                        <div class="metric-label">{label}</div>
                    </div>
                """
            
            html_content += """
                    </div>
                </div>
            """
            
            # 时间范围
            if 'time_range' in self.stats and self.stats['time_range']['start']:
                html_content += f"""
                <div class="section">
                    <h2 class="section-title">📅 时间范围</h2>
                    <p><strong>开始:</strong> {self.stats['time_range']['start']}</p>
                    <p><strong>结束:</strong> {self.stats['time_range']['end']}</p>
                    <p><strong>时间跨度:</strong> {self.stats['time_span_hours']:.2f} 小时 ({self.stats['time_span_seconds']:.0f} 秒)</p>
                </div>
                """
            
            # TOP10 表格
            def create_top10_table(title, data, primary_field, secondary_fields):
                if not data:
                    return f"<p>无数据</p>"
                
                table_html = f"""
                <table class="top10-table">
                    <tr>
                        <th>排名</th>
                        <th>事务ID</th>
                        <th>{primary_field}</th>
                """
                
                for field in secondary_fields:
                    table_html += f"<th>{field}</th>"
                
                table_html += "</tr>"
                
                for i, item in enumerate(data, 1):
                    table_html += f"""
                    <tr>
                        <td>#{i}</td>
                        <td>事务 {item[1]}</td>
                        <td>{item[0]:,}</td>
                    """
                    
                    for j in range(2, len(item)):
                        if "bytes" in secondary_fields[j-2].lower():
                            table_html += f"<td>{item[j]:,}</td>"
                        elif "ms" in secondary_fields[j-2].lower():
                            table_html += f"<td>{item[j]:.3f}</td>"
                        else:
                            table_html += f"<td>{item[j]:,}</td>"
                    
                    table_html += "</tr>"
                
                table_html += "</table>"
                return table_html
            
            html_content += f"""
            <div class="section">
                <h2 class="section-title">🏆 TOP10 排名</h2>
                <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(400px, 1fr)); gap: 20px;">
                    <div>
                        <h3>💾 事务大小</h3>
                        {create_top10_table("事务大小", self.stats['top10_transaction_sizes'], "大小(bytes)", ["时长(ms)", "事件数", "影响行数"])}
                    </div>
                    <div>
                        <h3>⏱️ 事务时长</h3>
                        {create_top10_table("事务时长", self.stats['top10_transaction_durations'], "时长(ms)", ["大小(bytes)", "事件数", "影响行数"])}
                    </div>
                    <div>
                        <h3>📊 事务事件数</h3>
                        {create_top10_table("事务事件数", self.stats['top10_transaction_events'], "事件数", ["大小(bytes)", "时长(ms)", "影响行数"])}
                    </div>
                    <div>
                        <h3>📈 事务影响行数</h3>
                        {create_top10_table("事务影响行数", self.stats['top10_transaction_rows'], "影响行数", ["大小(bytes)", "时长(ms)", "事件数"])}
                    </div>
                </div>
            </div>
            """
            
            # DML操作统计
            if self.stats['dml_operations']:
                dml_total_events = sum(op_stats['events_count'] for op_stats in self.stats['dml_operations'].values())
                dml_total_rows = sum(op_stats['rows_affected'] for op_stats in self.stats['dml_operations'].values())
                html_content += f"""
                <div class="section">
                    <h2 class="section-title">🛠️ DML操作统计</h2>
                    <div class="validation">
                        <strong>数据验证:</strong><br>
                        DML事件总数 = {dml_total_events}, 总事件数 = {self.stats['total_events']}<br>
                        DML影响行数 = {dml_total_rows}, 总DML影响行数 = {self.stats.get('total_dml_rows', 0)}
                    </div>
                    <table>
                        <tr>
                            <th>操作类型</th>
                            <th>DML事件数</th>
                            <th>事件占比</th>
                            <th>DML影响行数</th>
                            <th>行数占比</th>
                            <th>涉及事务数</th>
                            <th>事务占比</th>
                        </tr>
                """
                for op_type, op_stats in self.stats['dml_operations'].items():
                    if op_stats['events_count'] > 0 or op_stats['rows_affected'] > 0:
                        event_percent = (op_stats['events_count'] / dml_total_events * 100) if dml_total_events > 0 else 0
                        row_percent = (op_stats['rows_affected'] / dml_total_rows * 100) if dml_total_rows > 0 else 0
                        trans_percent = (op_stats['transactions_count'] / self.stats['total_transactions'] * 100) if self.stats['total_transactions'] > 0 else 0
                        html_content += f"""
                        <tr>
                            <td>{op_type}</td>
                            <td>{op_stats['events_count']}</td>
                            <td>{event_percent:.1f}%</td>
                            <td>{op_stats['rows_affected']}</td>
                            <td>{row_percent:.1f}%</td>
                            <td>{op_stats['transactions_count']}</td>
                            <td>{trans_percent:.1f}%</td>
                        </tr>
                        """
                html_content += "</table></div>"
            
            # 表操作TOP10
            def create_table_top10(title, data):
                if not data:
                    return "<p>无数据</p>"
                
                table_html = f"""
                <table class="top10-table">
                    <tr>
                        <th>排名</th>
                        <th>表名</th>
                        <th>影响行数</th>
                    </tr>
                """
                for i, (table, row_count) in enumerate(data, 1):
                    table_html += f"""
                    <tr>
                        <td>#{i}</td>
                        <td>{table}</td>
                        <td>{row_count:,}</td>
                    </tr>
                    """
                table_html += "</table>"
                return table_html
            
            html_content += f"""
            <div class="section">
                <h2 class="section-title">📋 表操作统计</h2>
                <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); gap: 20px;">
                    <div>
                        <h3>📥 INSERT操作TOP10</h3>
                        {create_table_top10("INSERT", self.stats.get('top10_insert_tables', []))}
                    </div>
                    <div>
                        <h3>📤 DELETE操作TOP10</h3>
                        {create_table_top10("DELETE", self.stats.get('top10_delete_tables', []))}
                    </div>
                    <div>
                        <h3>🔄 UPDATE操作TOP10</h3>
                        {create_table_top10("UPDATE", self.stats.get('top10_update_tables', []))}
                    </div>
                </div>
            </div>
            """
            
            # 添加说明
            html_content += """
            <div class="section">
                <h2 class="section-title">📝 说明</h2>
                <div class="note">
                    <p><strong>统计指标定义:</strong></p>
                    <ul>
                        <li><strong>总事件数:</strong> binlog中记录的事件总数</li>
                        <li><strong>总DML影响行数:</strong> DML操作影响的数据行数总和</li>
                        <li><strong>DML事件数:</strong> DML操作的事件数量</li>
                        <li><strong>DML影响行数:</strong> DML操作影响的数据行数</li>
                    </ul>
                </div>
            </div>
            """
        
        html_content += """
                <div class="timestamp">
                    <p>报告生成工具: MySQL Binary Log Analyzer (直接解析模式)</p>
                </div>
            </div>
        </body>
        </html>
        """
        
        try:
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(html_content)
            print(f"✅ HTML报告已生成: {output_file}")
            return True
        except Exception as e:
            print(f"❌ 生成HTML报告失败: {e}")
            return False

View Code

3. 执行显示

[root@zb-yunweitest-mysql-204-200 BinlogAny]# python3 binlog_analyzer_main.py 3306-bin.000020 --host 127.0.0.1 --user root --port 3306  --start-pos 16691054
请输入MySQL密码: 
正在使用MySQL连接解析二进制binlog文件: 3306-bin.000020
MySQL连接信息: 127.0.0.1:3306 (用户: root)
起始解析位置: 16691054
开始解析binlog事件...
WARNING:root:
                    Before using MARIADB 10.5.0 and MYSQL 8.0.14 versions,
                    use python-mysql-replication version Before 1.0 version 
解析统计:
  - 处理的事件总数: 2392
  - 找到的事务数: 6
  - 总DML操作数: 200608

📊 解析模式: binary

================================================================================
🎯 MySQL二进制日志分析报告 (直接解析)
================================================================================
📅 时间范围: 2025-10-28 17:40:54 到 2025-10-28 17:45:46
⏳ 时间跨度: 0.08 小时 (292 秒)

🏆 事务大小 TOP10 (字节):
──────────────────────────────────────────────────────────────────────────────────────────
   # 1 事务 17588346:      809,184 bytes | 时长:    0.000 ms | 事件数:   99 | 影响行数:    98304
   # 2 事务 16691362:      805,344 bytes | 时长:    0.000 ms | 事件数:   99 | 影响行数:    98304
   # 3 事务 17539676:       24,195 bytes | 时长:    0.000 ms | 事件数:    3 | 影响行数:     1000
   # 4 事务 17564011:       24,195 bytes | 时长:    0.000 ms | 事件数:    3 | 影响行数:     1000
   # 5 事务 17516341:       23,195 bytes | 时长:    0.000 ms | 事件数:    3 | 影响行数:     1000
   # 6 事务 17496846:       19,355 bytes | 时长:    0.000 ms | 事件数:    3 | 影响行数:     1000

⏱️  事务时长 TOP10 (毫秒):
──────────────────────────────────────────────────────────────────────────────────────────
   # 1 事务 16691362:        0.000 ms | 大小:    805,344 bytes | 事件数:   99 | 影响行数:    98304
   # 2 事务 17496846:        0.000 ms | 大小:     19,355 bytes | 事件数:    3 | 影响行数:     1000
   # 3 事务 17516341:        0.000 ms | 大小:     23,195 bytes | 事件数:    3 | 影响行数:     1000
   # 4 事务 17539676:        0.000 ms | 大小:     24,195 bytes | 事件数:    3 | 影响行数:     1000
   # 5 事务 17564011:        0.000 ms | 大小:     24,195 bytes | 事件数:    3 | 影响行数:     1000
   # 6 事务 17588346:        0.000 ms | 大小:    809,184 bytes | 事件数:   99 | 影响行数:    98304

📊 事务事件数 TOP10:
──────────────────────────────────────────────────────────────────────────────────────────
   # 1 事务 16691362:       99 个事件 | 大小:    805,344 bytes | 时长:    0.000 ms | 影响行数:    98304
   # 2 事务 17588346:       99 个事件 | 大小:    809,184 bytes | 时长:    0.000 ms | 影响行数:    98304
   # 3 事务 17496846:        3 个事件 | 大小:     19,355 bytes | 时长:    0.000 ms | 影响行数:     1000
   # 4 事务 17516341:        3 个事件 | 大小:     23,195 bytes | 时长:    0.000 ms | 影响行数:     1000
   # 5 事务 17539676:        3 个事件 | 大小:     24,195 bytes | 时长:    0.000 ms | 影响行数:     1000
   # 6 事务 17564011:        3 个事件 | 大小:     24,195 bytes | 时长:    0.000 ms | 影响行数:     1000

📈 事务影响行数 TOP10:
──────────────────────────────────────────────────────────────────────────────────────────
   # 1 事务 16691362:   98,304 行 | 大小:    805,344 bytes | 时长:    0.000 ms | 事件数:   99
   # 2 事务 17588346:   98,304 行 | 大小:    809,184 bytes | 时长:    0.000 ms | 事件数:   99
   # 3 事务 17496846:    1,000 行 | 大小:     19,355 bytes | 时长:    0.000 ms | 事件数:    3
   # 4 事务 17516341:    1,000 行 | 大小:     23,195 bytes | 时长:    0.000 ms | 事件数:    3
   # 5 事务 17539676:    1,000 行 | 大小:     24,195 bytes | 时长:    0.000 ms | 事件数:    3
   # 6 事务 17564011:    1,000 行 | 大小:     24,195 bytes | 时长:    0.000 ms | 事件数:    3

📈 总体事务统计:
──────────────────────────────────────────────────
   📦 事务总数:              6
   📊 总事件数:            210
   📈 总DML影响行数:    200608
   ⏱️  总时长:               0.000 ms
   💾 总大小:            1,705,468 bytes
   📊 平均时长:              0.000 ms
   📏 平均大小:            284,245 bytes
   🔢 平均事件数:           35.0
   📈 平均影响行数:        33435

🛠️  DML操作统计:
──────────────────────────────────────────────────
   🔍 事件数验证: DML事件总数 = 210, 总事件数 = 210
   🔍 行数验证: DML影响行数 = 200608, 总DML影响行数 = 200608
   INSERT:
        DML事件数:           99 ( 47.1%)
        DML影响行数:      98304 ( 49.0%)
        涉及事务数:           1 ( 16.7%)
   UPDATE:
        DML事件数:           12 (  5.7%)
        DML影响行数:       4000 (  2.0%)
        涉及事务数:           4 ( 66.7%)
   DELETE:
        DML事件数:           99 ( 47.1%)
        DML影响行数:      98304 ( 49.0%)
        涉及事务数:           1 ( 16.7%)

📥 INSERT操作 TOP10 表:
──────────────────────────────────────────────────
   # 1 test999.test993                             98304 行操作

📤 DELETE操作 TOP10 表:
──────────────────────────────────────────────────
   # 1 test999.test993                             98304 行操作

🔄 UPDATE操作 TOP10 表:
──────────────────────────────────────────────────
   # 1 test999.test993                              4000 行操作

================================================================================
✅ 二进制日志分析完成
✅ HTML报告已生成: binlog_report_20251028_201139.html

View Code

4.HTML 输出

5. 主要优势和改进

5.1. 直接二进制解析

无需使用mysqlbinlog工具转换
直接读取MySQL二进制日志文件
更高效的解析过程

5.2. 更准确的数据

精确识别事务边界（BEGIN/XID事件）
准确统计行级变更操作
支持ROW格式binlog的精确解析

5.3. 增强的功能

支持指定起始位置和服务器ID
更准确的时间戳解析
直接获取表名和数据库名

5.4. 性能优化

流式读取，内存占用更低
实时统计，无需预处理
支持大文件处理

6. 使用方法

# 基本用法（不近页面输出还会HTML输出）
python3 binlog_analyzer_main.py 3306-bin.000020 --host 127.0.0.1 --user root --port 3306

# 生成HTML报告
python3 binlog_analyzer_main.py 3306-bin.000020 --host 127.0.0.1 --user root --port 3306  --html report.html

# 指定起始位置（暂不支持）
python3 binlog_analyzer_main.py 3306-bin.000020 --host 127.0.0.1 --user root --port 3306  --start-pos 16691054

# 只生成HTML报告，不显示控制台输出
python3 binlog_analyzer_main.py 3306-bin.000020 --host 127.0.0.1 --user root --port 3306  --no-console --html report.html

7. 注意事项

依赖安装：确保安装了mysql-replication和pymysql
文件权限：需要读取二进制日志文件的权限
MySQL版本：支持MySQL 5.6+的二进制日志格式
binlog格式：建议使用ROW格式以获得最佳解析效果

这个直接解析版本比文本解析版本更准确、更高效，特别适合生产环境使用。

posted on 2025-10-28 13:53 xibuhaohao 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

1. 需求分析

1.1 需求明细

1.2 程序调佣顺序

1.3 数据流向

1.4 各文件详细说明

1. binlog_analyzer_main.py - 主控程序

2. binlog_parser.py - 数据解析核心

3. report_generator.py - 报告生成器

2. 代码实现

2.1 binlog_analyzer_main.py 主入口

2.2 binlog_parser.py 解析二进制日志文件

2.3 report_generator.py 生成报告

3. 执行显示

4.HTML 输出

5. 主要优势和改进

5.1. 直接二进制解析

5.2. 更准确的数据

5.3. 增强的功能

5.4. 性能优化

6. 使用方法

7. 注意事项

公告