grpc客户端优化
📚 使用须知
- 本博客内容仅供学习参考
- 建议理解思路后独立实现
- 欢迎交流讨论
故事的开始:固件升级流程
我们的任务是实现一个通过 gRPC 控制的固件升级流程。从逻辑上看,它非常直接:
- 客户端通过一个 gRPC 流式 (
Streaming) RPC 将固件文件上传到服务器。 - 服务器接收完文件后,执行一个脚本(例如
upgrade-rootfs.sh)来应用更新。 - 该脚本在最后会重启服务器,以加载新的固件。
- 客户端在上传完成后,进入一个轮询 (
Polling) 循环,反复查询服务器的状态,直到服务器成功响应,确认新固件已在运行
C++
ezhuzix@CNNJ002931:/mnt/c/Users/ezhuzix/repo_o/nib_application/grpc$ ./../build/binaries/grpc_client fwup 192.168.2.61:50050 ./TIB3_rp
i_software_package.tar "CXP9024418/1" R1A
Firmware Update Client
Server: 192.168.2.61:50050
File: ./TIB3_rpi_software_package.tar
Product: CXP9024418/1
RState: R1A
------------------------
Firmware client connected to: 192.168.2.61:50050
Testing firmware connection...
[OK] Firmware connection test succeeded
Starting complete firmware upgrade process...
Step 1: Uploading firmware...
Testing upload: ./TIB3_rpi_software_package.tar
File size: 137553920 bytes
[OK] Software item sent
Progress: 7% (10485760/137553920 bytes)
Progress: 15% (20971520/137553920 bytes)
Progress: 22% (31457280/137553920 bytes)
Progress: 30% (41943040/137553920 bytes)
Progress: 38% (52428800/137553920 bytes)
Progress: 45% (62914560/137553920 bytes)
Progress: 53% (73400320/137553920 bytes)
Progress: 60% (83886080/137553920 bytes)
Progress: 68% (94371840/137553920 bytes)
Progress: 76% (104857600/137553920 bytes)
Progress: 83% (115343360/137553920 bytes)
Progress: 91% (125829120/137553920 bytes)
Progress: 99% (136314880/137553920 bytes)
Progress: 100% (137553920/137553920 bytes)
[OK] All data sent (132 chunks)
[OK] Firmware upload succeeded
Server response: (code: 0)
[OK] Firmware upload completed
Step 2: Waiting for device to restart and come back online...
Waiting for device to start upgrade (2 minutes)...
Still waiting for upgrade to start... (30/120 seconds)
Still waiting for upgrade to start... (60/120 seconds)
Still waiting for upgrade to start... (90/120 seconds)
Still waiting for upgrade to start... (120/120 seconds)
Starting to poll GetSWInfos every 5 seconds (max 10 minutes)...
[OK] Device is back online! New firmware is running.
Current software information:
- Name: TIB3
- Product Number: CXC1744343/2-R1E-0-g2aefebe
- R-State: R1E
- Comment: Build time: 2025-05-27 05:48:17+00:00
[OK] Firmware upgrade completed successfully!
[SUCCESS] Operation completed successfully!
以下是终端输出的对应代码段解析,按执行顺序排列:
1. 初始化输出
终端输出:
Firmware Update Client
Server: 192.168.2.61:50050
File: ./TIB3_rpi_software_package.tar
Product: CXP9024418/1
RState: R1A
------------------------
对应代码段 (grpc_client_main.cc):
std::cout << "Firmware Update Client" << std::endl;
std::cout << "Server: " << server_address << std::endl;
std::cout << "File: " << file_path << std::endl;
std::cout << "Product: "
<< (product_number.empty() ? "(empty)" : product_number)
<< std::endl;
std::cout << "RState: " << (rstate.empty() ? "(empty)" : rstate)
<< std::endl;
std::cout << "------------------------" << std::endl;
解析:
- 简单的控制台输出,显示命令行参数
- 使用
std::cout进行格式化输出
2. 客户端连接
终端输出:
Firmware client connected to: 192.168.2.61:50050
对应代码段 (simple_firmware_client.cc):
SimpleFirmwareClient::SimpleFirmwareClient(const std::string &server_address)
: BaseGrpcClient(server_address) {
auto channel = CreateChannel();
firmware_stub_ = proto::raptor2::ms::test_interface::
TestInterfaceFirmwareUpdateService::NewStub(channel);
config_stub_ = proto::raptor2::ms::test_interface::
TestInterfaceConfigurationService::NewStub(channel);
std::cout << "Firmware client connected to: " << server_address << std::endl;
}
解析:
- 构造函数中创建 gRPC 通道和 stub
CreateChannel():继承自BaseGrpcClient,创建 gRPC 通信通道NewStub(channel):生成 gRPC 客户端存根,用于调用远程方法
3. 连接测试
终端输出:
Testing firmware connection...
[OK] Firmware connection test succeeded
对应代码段 (simple_firmware_client.cc):
bool SimpleFirmwareClient::TestConnection() {
std::cout << "Testing firmware connection..." << std::endl;
ClientContext context;
ProductionStatus response;
FirmwareUpdateRequest request;
request.set_dut_position(0);
std::unique_ptr<ClientWriter<FirmwareUpdateRequest>> writer(
firmware_stub_->FirmwareUpdate(&context, &response));
bool write_ok = writer->Write(request);
writer->WritesDone();
Status status = writer->Finish();
return CheckStatus(status, "Firmware connection test") && write_ok;
}
gRPC 解析:
- 创建 ClientContext:包含调用元数据、截止时间等
- 创建流式写入器:
ClientWriter<FirmwareUpdateRequest>> - 写入请求:
writer->Write(request)发送测试请求 - 结束写入:
writer->WritesDone()表示流结束 - 获取状态:
writer->Finish()等待服务器响应
4. 固件升级开始
终端输出:
Starting complete firmware upgrade process...
Step 1: Uploading firmware...
Testing upload: ./TIB3_rpi_software_package.tar
File size: 137553920 bytes
对应代码段 (simple_firmware_client.cc):
bool SimpleFirmwareClient::FirmwareUpgrade(const std::string &file_path,
const std::string &product_number,
const std::string &rstate) {
std::cout << "Starting complete firmware upgrade process..." << std::endl;
std::cout << "Step 1: Uploading firmware..." << std::endl;
// 调用 UploadWithManualInfo
}
bool SimpleFirmwareClient::UploadWithManualInfo(...) {
std::cout << "Testing upload: " << file_path << std::endl;
std::ifstream file(file_path, std::ios::binary | std::ios::ate);
size_t file_size = file.tellg();
std::cout << "File size: " << file_size << " bytes" << std::endl;
}
5. 创建软件项目信息
终端输出:
[OK] Software item sent
对应代码段 (simple_firmware_client.cc):
SoftwareItem software_item =
software_repository_test::SoftwareFileManager::CreateFromFile(
file_path, product_number, rstate, "Firmware update");
FirmwareUpdateRequest item_request;
item_request.set_dut_position(0);
*item_request.mutable_item() = software_item;
if (!writer->Write(item_request)) {
// 错误处理
}
std::cout << client_common::SUCCESS_SYMBOL << " Software item sent" << std::endl;
gRPC 解析:
- 使用 Protobuf 消息
SoftwareItem描述固件信息 mutable_item():获取 item 字段的可变引用并赋值writer->Write(item_request):发送描述信息的第一个消息
6. 分块传输数据
终端输出:
Progress: 7% (10485760/137553920 bytes)
...
Progress: 100% (137553920/137553920 bytes)
[OK] All data sent (132 chunks)
对应代码段 (simple_firmware_client.cc):
const size_t CHUNK_SIZE = client_common::FIRMWARE_CHUNK_SIZE;
std::vector<char> buffer(CHUNK_SIZE);
while (file.read(buffer.data(), CHUNK_SIZE) || file.gcount() > 0) {
size_t bytes_read = file.gcount();
FirmwareUpdateRequest content_request;
content_request.set_dut_position(0);
SoftwareItemContent *content = content_request.mutable_content();
content->set_swtype(software_item.sw_type());
content->set_data(buffer.data(), bytes_read);
if (!writer->Write(content_request)) {
// 错误处理
}
// 进度显示
if (chunk_count % client_common::PROGRESS_UPDATE_INTERVAL == 0 ||
total_sent == file_size) {
int progress = (file_size > 0) ? (total_sent * 100) / file_size : 0;
std::cout << "Progress: " << progress << "% (" << total_sent << "/"
<< file_size << " bytes)" << std::endl;
}
}
gRPC 解析:
- 流式传输:通过同一个 gRPC 流连续发送多个消息
- Chunked 传输:大文件分块发送,每块 1MB
- 数据消息:使用
SoftwareItemContent消息类型携带二进制数据
7. 传输完成
终端输出:
[OK] Firmware upload succeeded
Server response: (code: 0)
对应代码段 (simple_firmware_client.cc):
if (!writer->WritesDone()) {
std::cout << client_common::FAILURE_SYMBOL << " Failed to complete write stream" << std::endl;
return false;
}
Status status = writer->Finish();
if (!CheckStatus(status, "Firmware upload")) {
return false;
}
std::cout << "Server response: " << response.message()
<< " (code: " << response.code() << ")" << std::endl;
return response.code() == 0; // 0 means OK
gRPC 解析:
- WritesDone():通知服务器客户端已写完所有消息
- Finish():等待服务器完成处理并返回最终状态
- 响应处理:检查服务器的
ProductionStatus响应
8. 等待设备重启
终端输出:
[OK] Firmware upload completed
Step 2: Waiting for device to restart and come back online...
Waiting for device to start upgrade (2 minutes)...
Still waiting for upgrade to start... (30/120 seconds)
...
对应代码段 (simple_firmware_client.cc):
bool SimpleFirmwareClient::WaitForDeviceAndGetSWInfo(int max_retries) {
std::cout << "Waiting for device to start upgrade (2 minutes)..."
<< std::endl;
// Wait for device to start upgrade
for (int i = 0; i < client_common::DEVICE_RESTART_WAIT_SECONDS; ++i) {
std::this_thread::sleep_for(std::chrono::seconds(1));
if (i % 30 == 29) {
std::cout << "Still waiting for upgrade to start... (" << (i + 1) << "/"
<< client_common::DEVICE_RESTART_WAIT_SECONDS << " seconds)"
<< std::endl;
}
}
}
9. 轮询设备状态
终端输出:
Starting to poll GetSWInfos every 5 seconds (max 10 minutes)...
[OK] Device is back online! New firmware is running.
Current software information:
- Name: TIB3
- Product Number: CXC1744343/2-R1E-0-g2aefebe
- R-State: R1E
- Comment: Build time: 2025-05-27 05:48:17+00:00
对应代码段 (simple_firmware_client.cc):
for (int attempt = 1; attempt <= client_common::MAX_POLL_ATTEMPTS; ++attempt) {
try {
grpc::ClientContext context;
proto::raptor2::test_interface::GetSWInfoRequest request;
proto::raptor2::test_interface::SWInfos response;
grpc::Status status =
config_stub_->GetSWInfos(&context, request, &response);
if (status.ok() && response.sw_info_size() > 0) {
std::cout << client_common::SUCCESS_SYMBOL
<< " Device is back online! New firmware is running."
<< std::endl;
std::cout << "Current software information:" << std::endl;
for (const auto &sw_info : response.sw_info()) {
std::cout << " - Name: " << sw_info.sw_name() << std::endl;
std::cout << " - Product Number: " << sw_info.product_number()
<< std::endl;
std::cout << " - R-State: " << sw_info.product_rstate() << std::endl;
std::cout << " - Comment: " << sw_info.comment() << std::endl;
}
return true;
}
} catch (const std::exception &e) {
// Ignore exceptions, continue polling
}
}
gRPC 解析:
- Unary RPC 调用:
GetSWInfos是典型的请求-响应模式 - 使用 config_stub_:切换到了配置服务存根
- 错误处理:使用 try-catch 忽略异常,继续轮询
10. 最终成功输出
终端输出:
[OK] Firmware upgrade completed successfully!
[SUCCESS] Operation completed successfully!
对应代码段 (simple_firmware_client.cc):
std::cout << client_common::SUCCESS_SYMBOL
<< " Firmware upgrade completed successfully!" << std::endl;
对应代码段 (grpc_client_main.cc):
if (success) {
std::cout << std::endl
<< "[SUCCESS] Operation completed successfully!" << std::endl;
return 0;
}
gRPC 技术要点总结:
-
混合模式:
- 固件上传使用 客户端流式 RPC(streaming)
- 状态查询使用 一元 RPC(unary)
-
流式传输特点:
- 单连接多消息传输
- 支持大文件分块
- 需要明确的流结束信号
-
错误处理:
- 检查 gRPC 状态码
- 处理服务器业务逻辑错误码
- 优雅的重试机制
-
资源管理:
- 使用智能指针管理 stub 生命周期
- 正确关闭流式连接
- 合理的超时和重试配置
gRPC python客户端
在这篇技术博文中,我们将深入剖析一个在实现 gRPC 固件升级流程中遇到的典型问题:当服务器被强制重启后,客户端为何无法通过轮询(Polling)重新建立联系? 我们不仅会揭示问题背后的 gRPC 连接管理机制,还将展示如何通过优秀的代码结构设计,将一个复杂、臃肿的函数重构为清晰、健壮、可维护的模块。
场景设定:固件升级
我们的目标是构建一个 Python 客户端,通过 gRPC 实现对嵌入式设备的固件升级。其核心流程分为两个主要阶段:
- 文件上传:客户端使用 gRPC 客户端流式(Client Streaming)RPC,将一个数 MB 大小的固件文件高效地传输到服务器。
- 状态验证:服务器在接收文件后会执行重启以应用更新。客户端必须进入一个轮询循环,使用一个一元(Unary)RPC 反复查询设备状态,直到确认新固件已成功运行。
gRPC 核心组件回顾
在深入问题之前,让我们快速回顾一下本次场景中涉及的关键 gRPC 组件:
- Channel (
grpc.insecure_channel): 这是客户端与服务器端点之间的虚拟连接。它封装了底层的 TCP 连接、重连策略和所有复杂的网络细节。一个 Channel 对象可以被多个 Stub 复用,它的创建和销毁相对昂贵。 - Stub (
YourService_pb2_grpc.YourServiceStub): 这是从.proto文件自动生成的客户端代理。我们通过调用 Stub 对象上的方法,来发起对远端服务的 RPC 调用。Stub 本身是轻量级的,它必须依附于一个 Channel 才能工作。 - 客户端流式 RPC: 允许客户端向服务器发送一系列消息。在我们的案例中,客户端将固件文件切分成数据块(chunks),像流水一样持续发送给服务器。这通过在客户端提供一个迭代器(iterator) 来实现。
- 一元 RPC: 最简单的 RPC 类型,客户端发送一个请求,服务器返回一个响应。非常适合用于状态查询。
问题的浮现:无法唤醒的“僵尸连接”
我们的初步实现遵循了一个看似合理的逻辑:在客户端启动时创建一个长生命周期的 Channel 和 Stub,并在整个升级流程中复用它们。
然而,流程总是在第二阶段失败。尽管服务器已经重启完毕,但客户端的轮询请求永远得不到响应,最终因超时而失败。独立的测试脚本(每次都创建新连接)却能成功,这让我们将疑点锁定在了连接的复用上。
技术层面的根本原因:
当服务器执行 reboot 时,它粗暴地中断了底层的 TCP 连接,并不会发送一个标准的 FIN 报文来优雅地关闭连接。对于客户端的 gRPC Channel 来说,它持有的网络连接突然“人间蒸发”了。
此时,这个 Channel 对象就进入了一种不稳定的“僵尸状态”。它内部的状态机可能还认为连接是活跃的,或者正在尝试重连一个已经失效的旧套接字(socket)。当我们的轮询逻辑立即开始工作,并试图复用这个“僵尸 Channel”时,所有的 RPC 请求都注定失败,因为它们无法通过这个已经损坏的“桥梁”到达服务器。
解决方案:在轮询中创建“临时”连接
既然长生命周期的连接在服务器重启后变得不可信,那么最可靠的策略就是抛弃它。在每次需要确认服务器状态时,我们都应该创建一个全新的、短暂的连接。
这正是我们对轮询函数 _wait_for_device_and_get_sw_info 进行的核心改造:
# 优化后的轮询函数
def _wait_for_device_and_get_sw_info(self) -> bool:
# ... (其他逻辑) ...
for attempt in range(self.MAX_POLL_ATTEMPTS):
try:
# --- 核心修复:在每次循环中创建全新的临时 Channel 和 Stub ---
with grpc.insecure_channel(self.server_address) as channel:
temp_stub = configuration_service_pb2_grpc.TestInterfaceConfigurationServiceStub(channel)
request = empty_pb2.Empty()
# 使用这个全新的、干净的连接发起请求
response = temp_stub.GetSWInfos(request, timeout=self.DEFAULT_TIMEOUT)
self.logger.info("✓ Device is online and software info confirmed.")
return True
except grpc.RpcError:
# 在此阶段,连接失败是预期之中的,我们安静地等待下一次尝试
pass
time.sleep(self.POLL_INTERVAL_SECONDS)
# ... (超时逻辑) ...
with grpc.insecure_channel(...) 语句确保了在每次轮询尝试中,我们都建立了一个全新的 TCP 连接。请求结束后,该连接被立即、安全地关闭。这种“阅后即焚”的模式,完美地解决了“僵尸连接”问题。
从混乱到清晰:代码结构的重构之旅
在修复核心 Bug 之后,我们审视了代码结构,发现 firmware_upgrade 函数过于臃肿,违反了单一职责原则(Single Responsibility Principle)。一个好的函数应该只做一件事,并把它做好。
重构前的结构:一个“大泥球”
def firmware_upgrade(self, file_path: str) -> bool:
# 1. 打开文件,获取文件大小
# 2. 创建 tqdm 进度条
# 3. 定义一个嵌套的 request_iterator 函数用于 gRPC 流
# 4. 调用 gRPC 流式上传方法
# 5. 处理上传响应
# 6. 如果上传成功,调用轮询函数
# 7. 处理轮询结果
# 8. 返回最终状态
# 9. 混合了大量的异常处理
这种结构难以阅读、测试和维护。
重构后的结构:职责分明的“经理”与“专家”
我们将其重构为三个职责分明的函数,如同一个分工明确的团队:
-
_send_firmware_stream(流上传专家)- 单一职责:专门负责处理 gRPC 客户端流式 RPC 的所有细节。
- 实现:内部包含
request_iterator生成器,管理文件块的读取和tqdm进度条的更新。它只关心如何高效地把文件发送出去,并返回一个简单的成功或失败标志。
-
_wait_for_device_and_get_sw_info(轮询专家)- 单一职责:专门负责在服务器重启后,通过创建临时连接进行轮询,直到确认设备上线或超时。
- 实现:包含了我们解决“僵尸连接”问题的核心逻辑。
-
firmware_upgrade(项目经理)- 单一职责:作为高层协调器,编排整个升级流程。
- 实现:它的代码变得像一份清晰的计划书,极具可读性:
def firmware_upgrade(self, file_path: str) -> bool: self.logger.info(f"Starting firmware upgrade with file: {file_path}") try: # 步骤 1: 委托“上传专家”发送文件 upload_success = self._send_firmware_stream(file_path) if not upload_success: return False # 步骤 2: 委托“轮询专家”确认设备状态 self.logger.info("Upload complete. Waiting for device...") device_online = self._wait_for_device_and_get_sw_info() # 报告最终结果 if device_online: self.logger.info("Firmware upgrade completed successfully!") else: self.logger.error("Device did not come back online.") return device_online except Exception: # “经理”只处理流程中的顶层异常 self.logger.exception("An unexpected error occurred during the upgrade process.") return False
通过这次重构,代码的可读性、可测试性和可维护性都得到了质的飞跃。
结论
本次实践带给我们两个核心启示:
- gRPC 连接管理:在涉及服务强制重启的场景下,必须对客户端持有的长连接状态保持警惕。最健壮的策略是在状态不可信时果断抛弃旧连接,并在需要时创建新的临时连接来完成任务。
- 代码结构设计:遵循单一职责原则进行重构,将一个复杂的流程拆分为多个职责单一的函数,是提升代码质量的必经之路。它不仅让代码更易于理解,也使得像“僵尸连接”这样的复杂问题更容易被定位和修复。
从一个棘手的 Bug 出发,我们最终收获的不仅是一个能正常工作的功能,更是一套经过优化的、高质量的代码库,以及对 gRPC 客户端设计的更深理解。
ezhuzix@CNNJ002931:/mnt/c/Users/ezhuzix/repo_o/nib_application/grpc$ python3 simple_firmware_client.py upgrade 192.168.2.61:50050 TIB3_rpi_software_package.tar "CXC1744343/2" R1F
[SUCCESS] 导入 protobuf 模块成功
准备执行完整固件升级:
文件: TIB3_rpi_software_package.tar
大小: 137,553,920 字节
产品: CXC1744343/2
状态: R1F
位置: 0
最大轮询: 120次 (约10.0分钟)
确认执行完整升级流程? (y/N): y
[DEBUG] 连接到 192.168.2.61:50050
[OK] Firmware client connected to: 192.168.2.61:50050
[DEBUG] 测试固件连接...
[OK] Firmware connection test succeeded
Starting complete firmware upgrade process...
============================================================
Step 1: Uploading firmware...
============================================================
Testing upload: TIB3_rpi_software_package.tar
File size: 137553920 bytes
[OK] Software item sent
Progress: 0% (655360/137553920 bytes)
Progress: 0% (1310720/137553920 bytes)
Progress: 1% (1966080/137553920 bytes)
Progress: 1% (2621440/137553920 bytes)
Progress: 2% (3276800/137553920 bytes)
Progress: 2% (3932160/137553920 bytes)
Progress: 3% (4587520/137553920 bytes)
Progress: 3% (5242880/137553920 bytes)
Progress: 4% (5898240/137553920 bytes)
Progress: 4% (6553600/137553920 bytes)
Progress: 5% (7208960/137553920 bytes)
Progress: 5% (7864320/137553920 bytes)
Progress: 6% (8519680/137553920 bytes)
Progress: 6% (9175040/137553920 bytes)
Progress: 7% (9830400/137553920 bytes)
Progress: 7% (10485760/137553920 bytes)
Progress: 8% (11141120/137553920 bytes)
Progress: 8% (11796480/137553920 bytes)
Progress: 9% (12451840/137553920 bytes)
Progress: 9% (13107200/137553920 bytes)
Progress: 10% (13762560/137553920 bytes)
Progress: 10% (14417920/137553920 bytes)
Progress: 10% (15073280/137553920 bytes)
Progress: 11% (15728640/137553920 bytes)
Progress: 11% (16384000/137553920 bytes)
Progress: 12% (17039360/137553920 bytes)
Progress: 12% (17694720/137553920 bytes)
Progress: 13% (18350080/137553920 bytes)
Progress: 13% (19005440/137553920 bytes)
Progress: 14% (19660800/137553920 bytes)
Progress: 14% (20316160/137553920 bytes)
Progress: 15% (20971520/137553920 bytes)
Progress: 15% (21626880/137553920 bytes)
Progress: 16% (22282240/137553920 bytes)
Progress: 16% (22937600/137553920 bytes)
Progress: 17% (23592960/137553920 bytes)
Progress: 17% (24248320/137553920 bytes)
Progress: 18% (24903680/137553920 bytes)
Progress: 18% (25559040/137553920 bytes)
Progress: 19% (26214400/137553920 bytes)
Progress: 19% (26869760/137553920 bytes)
Progress: 20% (27525120/137553920 bytes)
Progress: 20% (28180480/137553920 bytes)
Progress: 20% (28835840/137553920 bytes)
Progress: 21% (29491200/137553920 bytes)
Progress: 21% (30146560/137553920 bytes)
Progress: 22% (30801920/137553920 bytes)
Progress: 22% (31457280/137553920 bytes)
Progress: 23% (32112640/137553920 bytes)
Progress: 23% (32768000/137553920 bytes)
Progress: 24% (33423360/137553920 bytes)
Progress: 24% (34078720/137553920 bytes)
Progress: 25% (34734080/137553920 bytes)
Progress: 25% (35389440/137553920 bytes)
Progress: 26% (36044800/137553920 bytes)
Progress: 26% (36700160/137553920 bytes)
Progress: 27% (37355520/137553920 bytes)
Progress: 27% (38010880/137553920 bytes)
Progress: 28% (38666240/137553920 bytes)
Progress: 28% (39321600/137553920 bytes)
Progress: 29% (39976960/137553920 bytes)
Progress: 29% (40632320/137553920 bytes)
Progress: 30% (41287680/137553920 bytes)
Progress: 30% (41943040/137553920 bytes)
Progress: 30% (42598400/137553920 bytes)
Progress: 31% (43253760/137553920 bytes)
Progress: 31% (43909120/137553920 bytes)
Progress: 32% (44564480/137553920 bytes)
Progress: 32% (45219840/137553920 bytes)
Progress: 33% (45875200/137553920 bytes)
Progress: 33% (46530560/137553920 bytes)
Progress: 34% (47185920/137553920 bytes)
Progress: 34% (47841280/137553920 bytes)
Progress: 35% (48496640/137553920 bytes)
Progress: 35% (49152000/137553920 bytes)
Progress: 36% (49807360/137553920 bytes)
Progress: 36% (50462720/137553920 bytes)
Progress: 37% (51118080/137553920 bytes)
Progress: 37% (51773440/137553920 bytes)
Progress: 38% (52428800/137553920 bytes)
Progress: 38% (53084160/137553920 bytes)
Progress: 39% (53739520/137553920 bytes)
Progress: 39% (54394880/137553920 bytes)
Progress: 40% (55050240/137553920 bytes)
Progress: 40% (55705600/137553920 bytes)
Progress: 40% (56360960/137553920 bytes)
Progress: 41% (57016320/137553920 bytes)
Progress: 41% (57671680/137553920 bytes)
Progress: 42% (58327040/137553920 bytes)
Progress: 42% (58982400/137553920 bytes)
Progress: 43% (59637760/137553920 bytes)
Progress: 43% (60293120/137553920 bytes)
Progress: 44% (60948480/137553920 bytes)
Progress: 44% (61603840/137553920 bytes)
Progress: 45% (62259200/137553920 bytes)
Progress: 45% (62914560/137553920 bytes)
Progress: 46% (63569920/137553920 bytes)
Progress: 46% (64225280/137553920 bytes)
Progress: 47% (64880640/137553920 bytes)
Progress: 47% (65536000/137553920 bytes)
Progress: 48% (66191360/137553920 bytes)
Progress: 48% (66846720/137553920 bytes)
Progress: 49% (67502080/137553920 bytes)
Progress: 49% (68157440/137553920 bytes)
Progress: 50% (68812800/137553920 bytes)
Progress: 50% (69468160/137553920 bytes)
Progress: 50% (70123520/137553920 bytes)
Progress: 51% (70778880/137553920 bytes)
Progress: 51% (71434240/137553920 bytes)
Progress: 52% (72089600/137553920 bytes)
Progress: 52% (72744960/137553920 bytes)
Progress: 53% (73400320/137553920 bytes)
Progress: 53% (74055680/137553920 bytes)
Progress: 54% (74711040/137553920 bytes)
Progress: 54% (75366400/137553920 bytes)
Progress: 55% (76021760/137553920 bytes)
Progress: 55% (76677120/137553920 bytes)
Progress: 56% (77332480/137553920 bytes)
Progress: 56% (77987840/137553920 bytes)
Progress: 57% (78643200/137553920 bytes)
Progress: 57% (79298560/137553920 bytes)
Progress: 58% (79953920/137553920 bytes)
Progress: 58% (80609280/137553920 bytes)
Progress: 59% (81264640/137553920 bytes)
Progress: 59% (81920000/137553920 bytes)
Progress: 60% (82575360/137553920 bytes)
Progress: 60% (83230720/137553920 bytes)
Progress: 60% (83886080/137553920 bytes)
Progress: 61% (84541440/137553920 bytes)
Progress: 61% (85196800/137553920 bytes)
Progress: 62% (85852160/137553920 bytes)
Progress: 62% (86507520/137553920 bytes)
Progress: 63% (87162880/137553920 bytes)
Progress: 63% (87818240/137553920 bytes)
Progress: 64% (88473600/137553920 bytes)
Progress: 64% (89128960/137553920 bytes)
Progress: 65% (89784320/137553920 bytes)
Progress: 65% (90439680/137553920 bytes)
Progress: 66% (91095040/137553920 bytes)
Progress: 66% (91750400/137553920 bytes)
Progress: 67% (92405760/137553920 bytes)
Progress: 67% (93061120/137553920 bytes)
Progress: 68% (93716480/137553920 bytes)
Progress: 68% (94371840/137553920 bytes)
Progress: 69% (95027200/137553920 bytes)
Progress: 69% (95682560/137553920 bytes)
Progress: 70% (96337920/137553920 bytes)
Progress: 70% (96993280/137553920 bytes)
Progress: 70% (97648640/137553920 bytes)
Progress: 71% (98304000/137553920 bytes)
Progress: 71% (98959360/137553920 bytes)
Progress: 72% (99614720/137553920 bytes)
Progress: 72% (100270080/137553920 bytes)
Progress: 73% (100925440/137553920 bytes)
Progress: 73% (101580800/137553920 bytes)
Progress: 74% (102236160/137553920 bytes)
Progress: 74% (102891520/137553920 bytes)
Progress: 75% (103546880/137553920 bytes)
Progress: 75% (104202240/137553920 bytes)
Progress: 76% (104857600/137553920 bytes)
Progress: 76% (105512960/137553920 bytes)
Progress: 77% (106168320/137553920 bytes)
Progress: 77% (106823680/137553920 bytes)
Progress: 78% (107479040/137553920 bytes)
Progress: 78% (108134400/137553920 bytes)
Progress: 79% (108789760/137553920 bytes)
Progress: 79% (109445120/137553920 bytes)
Progress: 80% (110100480/137553920 bytes)
Progress: 80% (110755840/137553920 bytes)
Progress: 80% (111411200/137553920 bytes)
Progress: 81% (112066560/137553920 bytes)
Progress: 81% (112721920/137553920 bytes)
Progress: 82% (113377280/137553920 bytes)
Progress: 82% (114032640/137553920 bytes)
Progress: 83% (114688000/137553920 bytes)
Progress: 83% (115343360/137553920 bytes)
Progress: 84% (115998720/137553920 bytes)
Progress: 84% (116654080/137553920 bytes)
Progress: 85% (117309440/137553920 bytes)
Progress: 85% (117964800/137553920 bytes)
Progress: 86% (118620160/137553920 bytes)
Progress: 86% (119275520/137553920 bytes)
Progress: 87% (119930880/137553920 bytes)
Progress: 87% (120586240/137553920 bytes)
Progress: 88% (121241600/137553920 bytes)
Progress: 88% (121896960/137553920 bytes)
Progress: 89% (122552320/137553920 bytes)
Progress: 89% (123207680/137553920 bytes)
Progress: 90% (123863040/137553920 bytes)
Progress: 90% (124518400/137553920 bytes)
Progress: 90% (125173760/137553920 bytes)
Progress: 91% (125829120/137553920 bytes)
Progress: 91% (126484480/137553920 bytes)
Progress: 92% (127139840/137553920 bytes)
Progress: 92% (127795200/137553920 bytes)
Progress: 93% (128450560/137553920 bytes)
Progress: 93% (129105920/137553920 bytes)
Progress: 94% (129761280/137553920 bytes)
Progress: 94% (130416640/137553920 bytes)
Progress: 95% (131072000/137553920 bytes)
Progress: 95% (131727360/137553920 bytes)
Progress: 96% (132382720/137553920 bytes)
Progress: 96% (133038080/137553920 bytes)
Progress: 97% (133693440/137553920 bytes)
Progress: 97% (134348800/137553920 bytes)
Progress: 98% (135004160/137553920 bytes)
Progress: 98% (135659520/137553920 bytes)
Progress: 99% (136314880/137553920 bytes)
Progress: 99% (136970240/137553920 bytes)
Progress: 100% (137553920/137553920 bytes)
[OK] All data sent (2099 chunks)
Server response: (code: 0)
[OK] Firmware upload completed in 7.34 seconds
============================================================
Step 2: Waiting for device to restart and come back online...
============================================================
Waiting for device to start upgrade (120 seconds)...
Still waiting for upgrade to start... (30/120 seconds)
Still waiting for upgrade to start... (60/120 seconds)
Still waiting for upgrade to start... (90/120 seconds)
Still waiting for upgrade to start... (120/120 seconds)
Starting to poll GetSWInfos every 5 seconds (max 120 attempts)...
Still waiting for device response... (attempt 12/120)
[OK] Device is back online! New firmware is running.
Current software information:
- Name: TIB3
Product Number: CXC1744343/2-R1E-0-g2aefebe
R-State: R1E
Comment: Build time: 2025-05-27 05:48:17+00:00
============================================================
[OK] Firmware upgrade completed successfully!
Total time: 281.93 seconds
============================================================
[DEBUG] 连接已关闭,总运行时间: 281.93 秒
simple_firmware_client.py
#!/usr/bin/env python3
"""
Firmware Update Client - 修复完整流程版本
支持真正的固件升级并正确获取升级后信息
"""
import argparse
import sys
import os
import time
import hashlib
import traceback
from typing import Generator, List, Optional, Dict, Any
from datetime import datetime
try:
import grpc
except ImportError:
print("[ERROR] 需要安装 grpc: pip3 install grpcio grpcio-tools")
sys.exit(1)
# 添加 proto_python 目录到 Python 路径
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, os.path.join(current_dir, 'proto_python'))
try:
import test_interface_pb2 as pb2
import service_ms_test_interface_pb2_grpc as pb2_grpc
import common_enums_pb2
import type_pb2
from google.protobuf.timestamp_pb2 import Timestamp
from google.protobuf import empty_pb2
print("[SUCCESS] 导入 protobuf 模块成功")
except ImportError as e:
print(f"[ERROR] 导入失败: {e}")
sys.exit(1)
class CompleteFirmwareClient:
"""完整固件升级客户端 - 修复版"""
# 常量定义(与C++版本保持一致)
FIRMWARE_CHUNK_SIZE = 64 * 1024 # 64KB
PROGRESS_UPDATE_INTERVAL = 10 # 每10个块更新一次进度
DEVICE_RESTART_WAIT_SECONDS = 120 # 修复:等待2分钟,与C++版本一致
MAX_POLL_ATTEMPTS = 120 # 最大轮询次数(10分钟,每次5秒)
POLLING_INTERVAL_SECONDS = 5 # 轮询间隔(秒)
PROGRESS_LOG_INTERVAL = 12 # 每12次轮询(即1分钟)打印一次日志
# 状态符号(与C++版本保持一致)
SUCCESS_SYMBOL = "[OK]"
FAILURE_SYMBOL = "[FAIL]"
DEBUG_SYMBOL = "[DEBUG]"
def __init__(self, server_addr: str, verbose: bool = False):
self.server_addr = server_addr
self.verbose = verbose
self.channel = None
self.firmware_stub = None
self.config_stub = None
self.start_time = None
self.connected = False
def connect(self) -> bool:
"""连接服务器并初始化两个存根"""
print(f"{self.DEBUG_SYMBOL} 连接到 {self.server_addr}")
# 重试连接
max_retries = 3
for retry in range(max_retries):
try:
self.channel = grpc.insecure_channel(
self.server_addr,
options=[
('grpc.max_send_message_length', 100 * 1024 * 1024),
('grpc.max_receive_message_length', 100 * 1024 * 1024),
('grpc.keepalive_time_ms', 10000),
('grpc.keepalive_timeout_ms', 5000),
('grpc.keepalive_permit_without_calls', 1),
('grpc.http2.max_pings_without_data', 0),
]
)
# 初始化两个存根
self.firmware_stub = pb2_grpc.TestInterfaceFirmwareUpdateServiceStub(self.channel)
self.config_stub = pb2_grpc.TestInterfaceConfigurationServiceStub(self.channel)
# 测试通道
grpc.channel_ready_future(self.channel).result(timeout=10)
print(f"{self.SUCCESS_SYMBOL} Firmware client connected to: {self.server_addr}")
self.connected = True
# 测试固件服务连接
if self.test_connection():
return True
else:
print(f"{self.DEBUG_SYMBOL} 连接测试失败,重试 {retry+1}/{max_retries}")
except Exception as e:
print(f"{self.DEBUG_SYMBOL} 连接尝试 {retry+1}/{max_retries} 失败: {e}")
time.sleep(2)
print(f"{self.FAILURE_SYMBOL} 连接失败,尝试 {max_retries} 次后仍无法连接")
return False
def test_connection(self) -> bool:
"""测试固件服务连接"""
print(f"{self.DEBUG_SYMBOL} 测试固件连接...")
try:
# 创建测试请求生成器
def generate_test_request():
request = pb2.FirmwareUpdateRequest()
request.dut_position = 0
yield request
# 调用流式RPC测试
response = self.firmware_stub.FirmwareUpdate(
generate_test_request(),
timeout=10
)
print(f"{self.SUCCESS_SYMBOL} Firmware connection test succeeded")
return True
except grpc.RpcError as e:
print(f"{self.FAILURE_SYMBOL} Firmware connection test failed: {e.details()}")
return False
except Exception as e:
print(f"{self.FAILURE_SYMBOL} Firmware connection test error: {e}")
return False
def create_session(self) -> type_pb2.SessionId:
"""创建会话"""
session = type_pb2.SessionId()
session.device_id = "python_complete_client"
now = Timestamp()
now.GetCurrentTime()
session.start_time.CopyFrom(now)
return session
def calculate_md5_hash(self, filepath: str) -> str:
"""计算文件的MD5哈希值"""
md5 = hashlib.md5()
with open(filepath, 'rb') as f:
while chunk := f.read(8192):
md5.update(chunk)
return md5.hexdigest().lower()
def upload_with_manual_info(self, filepath: str, product_num: str,
rstate: str, dut_position: int = 0) -> bool:
"""上传固件"""
print(f"Testing upload: {filepath}")
# 检查文件
if not os.path.exists(filepath):
print(f"{self.FAILURE_SYMBOL} Cannot open file: {filepath}")
return False
filesize = os.path.getsize(filepath)
print(f"File size: {filesize} bytes")
try:
filename = os.path.basename(filepath)
filehash = self.calculate_md5_hash(filepath)
shared_session = self.create_session()
# 创建请求生成器
def generate_upload_requests():
# 1. 发送软件项信息
request1 = pb2.FirmwareUpdateRequest()
request1.session.CopyFrom(shared_session)
request1.dut_position = dut_position
# 创建SoftwareItem
item = pb2.SoftwareItem()
item.description = f"Firmware update"
item.product_number = product_num
item.rstate = rstate
item.sw_type = common_enums_pb2.SOFTWARE_TYPE_INITIAL_FLASH_IMAGE
item.filename = filename
item.hash = filehash
item.total_size = filesize
request1.item.CopyFrom(item)
yield request1
print(f"{self.SUCCESS_SYMBOL} Software item sent")
# 2. 发送文件数据
chunk_count = 0
total_sent = 0
with open(filepath, 'rb') as f:
while True:
chunk = f.read(self.FIRMWARE_CHUNK_SIZE)
if not chunk:
break
bytes_read = len(chunk)
request = pb2.FirmwareUpdateRequest()
request.session.CopyFrom(shared_session)
request.dut_position = dut_position
# 创建SoftwareItemContent
content = pb2.SoftwareItemContent()
content.swType = item.sw_type
content.data = chunk
request.content.CopyFrom(content)
total_sent += bytes_read
chunk_count += 1
# 显示进度
if (chunk_count % self.PROGRESS_UPDATE_INTERVAL == 0 or
total_sent == filesize):
if filesize > 0:
progress = (total_sent * 100) // filesize
else:
progress = 0
print(f"Progress: {progress}% ({total_sent}/{filesize} bytes)")
yield request
print(f"{self.SUCCESS_SYMBOL} All data sent ({chunk_count} chunks)")
# 调用流式RPC,设置较长超时时间
response = self.firmware_stub.FirmwareUpdate(
generate_upload_requests(),
timeout=600 # 10分钟超时
)
# 检查响应
if hasattr(response, 'code'):
print(f"Server response: {response.message} (code: {response.code})")
return response.code == 0
else:
print(f"{self.DEBUG_SYMBOL} Unexpected server response: {response}")
return True # 即使响应格式不同,也认为成功
except grpc.RpcError as e:
print(f"{self.FAILURE_SYMBOL} Upload failed: {e.details()}")
if self.verbose:
print(f"Error code: {e.code()}")
return False
except Exception as e:
print(f"{self.FAILURE_SYMBOL} Upload error: {e}")
if self.verbose:
traceback.print_exc()
return False
def get_sw_infos(self) -> Optional[object]:
"""获取设备软件信息。在轮询期间,此方法可能会因设备不可用而返回 None。"""
try:
request = empty_pb2.Empty()
# 直接调用已知的方法,超时时间应小于轮询间隔
response = self.config_stub.GetSWInfos(request, timeout=self.POLLING_INTERVAL_SECONDS - 1)
return response
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.UNAVAILABLE:
# 设备可能离线,这是轮询期间的正常情况
if self.verbose:
print(f"{self.DEBUG_SYMBOL} Device unavailable (expected during polling).")
else:
if self.verbose:
print(f"{self.DEBUG_SYMBOL} RPC error getting SW info: {e.details()}")
return None
except Exception as e:
if self.verbose:
print(f"{self.DEBUG_SYMBOL} Unexpected error getting SW info: {e}")
traceback.print_exc()
return None
def parse_sw_info_response(self, response) -> List[Dict[str, Any]]:
"""解析软件信息响应"""
sw_infos = []
try:
# 尝试不同的响应格式
if hasattr(response, 'sw_info'):
# 格式1: response.sw_info (列表)
for sw_info in response.sw_info:
info = {}
if hasattr(sw_info, 'sw_name'):
info['name'] = sw_info.sw_name
elif hasattr(sw_info, 'swName'):
info['name'] = sw_info.swName
if hasattr(sw_info, 'product_number'):
info['product_number'] = sw_info.product_number
elif hasattr(sw_info, 'productNumber'):
info['product_number'] = sw_info.productNumber
if hasattr(sw_info, 'product_rstate'):
info['rstate'] = sw_info.product_rstate
elif hasattr(sw_info, 'product_r_state'):
info['rstate'] = sw_info.product_r_state
elif hasattr(sw_info, 'rstate'):
info['rstate'] = sw_info.rstate
if hasattr(sw_info, 'comment'):
info['comment'] = sw_info.comment
sw_infos.append(info)
elif hasattr(response, 'software_info'):
# 格式2: response.software_info (列表)
for sw_info in response.software_info:
info = {}
if hasattr(sw_info, 'name'):
info['name'] = sw_info.name
if hasattr(sw_info, 'product'):
info['product_number'] = sw_info.product
if hasattr(sw_info, 'state'):
info['rstate'] = sw_info.state
if hasattr(sw_info, 'comment'):
info['comment'] = sw_info.comment
sw_infos.append(info)
elif hasattr(response, 'infos'):
# 格式3: response.infos (列表)
for sw_info in response.infos:
info = {}
# 尝试常见字段名
for attr in ['sw_name', 'name', 'software_name']:
if hasattr(sw_info, attr):
info['name'] = getattr(sw_info, attr)
break
for attr in ['product_number', 'product', 'product_id']:
if hasattr(sw_info, attr):
info['product_number'] = getattr(sw_info, attr)
break
for attr in ['product_rstate', 'rstate', 'state', 'revision']:
if hasattr(sw_info, attr):
info['rstate'] = getattr(sw_info, attr)
break
if hasattr(sw_info, 'comment'):
info['comment'] = sw_info.comment
sw_infos.append(info)
# 如果是单个对象而不是列表
elif hasattr(response, 'sw_name') or hasattr(response, 'product_number'):
info = {}
if hasattr(response, 'sw_name'):
info['name'] = response.sw_name
elif hasattr(response, 'name'):
info['name'] = response.name
if hasattr(response, 'product_number'):
info['product_number'] = response.product_number
elif hasattr(response, 'product'):
info['product_number'] = response.product
if hasattr(response, 'product_rstate'):
info['rstate'] = response.product_rstate
elif hasattr(response, 'rstate'):
info['rstate'] = response.rstate
if hasattr(response, 'comment'):
info['comment'] = response.comment
sw_infos.append(info)
except Exception as e:
print(f"{self.DEBUG_SYMBOL} 解析软件信息时出错: {e}")
if self.verbose:
traceback.print_exc()
return sw_infos
def wait_for_device_and_get_sw_info(self, max_retries: Optional[int] = None) -> bool:
"""等待设备重启并获取软件信息 - 与C++版本一致"""
if max_retries is None:
max_retries = self.MAX_POLL_ATTEMPTS
print(f"Waiting for device to start upgrade ({self.DEVICE_RESTART_WAIT_SECONDS} seconds)...")
# 等待设备开始升级(2分钟)
for i in range(self.DEVICE_RESTART_WAIT_SECONDS):
time.sleep(1)
if (i + 1) % 30 == 0:
print(f"Still waiting for upgrade to start... ({i+1}/{self.DEVICE_RESTART_WAIT_SECONDS} seconds)")
print(f"Starting to poll GetSWInfos every {self.POLLING_INTERVAL_SECONDS} seconds (max {max_retries} attempts)...")
# 轮询设备状态
for attempt in range(1, max_retries + 1):
response = None
try:
# --- 核心修复:在每次循环中创建全新的临时连接 ---
# 这可以避免因服务器重启导致的旧连接失效问题
with grpc.insecure_channel(self.server_addr) as channel:
# 等待通道就绪,超时时间要短于轮询间隔
grpc.channel_ready_future(channel).result(timeout=self.POLLING_INTERVAL_SECONDS - 1)
temp_stub = pb2_grpc.TestInterfaceConfigurationServiceStub(channel)
request = empty_pb2.Empty()
# 使用一个较短的RPC超时
response = temp_stub.GetSWInfos(request, timeout=2)
except grpc.RpcError as e:
# 在轮询期间,UNAVAILABLE 和 DEADLINE_EXCEEDED 是正常现象
if e.code() not in (grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.DEADLINE_EXCEEDED):
if self.verbose:
print(f"{self.DEBUG_SYMBOL} 轮询时发生非预期的RPC错误 (attempt {attempt}): {e.details()}")
except KeyboardInterrupt:
print("\n轮询被用户中断")
return False
except Exception as e:
# 捕获其他异常,例如 channel_ready_future 超时
if self.verbose:
print(f"{self.DEBUG_SYMBOL} 轮询异常 (attempt {attempt}): {e}")
# 检查是否成功获取响应
if response is not None:
print(f"{self.SUCCESS_SYMBOL} Device is back online! New firmware is running.")
# 解析软件信息
sw_infos = self.parse_sw_info_response(response)
if sw_infos:
print("Current software information:")
for info in sw_infos:
info_str = f" - Name: {info.get('name', 'Unknown')}"
if 'product_number' in info:
info_str += f"\n Product Number: {info['product_number']}"
if 'rstate' in info:
info_str += f"\n R-State: {info['rstate']}"
if 'comment' in info and info['comment']:
info_str += f"\n Comment: {info['comment']}"
print(info_str)
else:
print(f"{self.DEBUG_SYMBOL} 解析到响应但无软件信息,原始响应: {response}")
return True
# 设备尚未准备好,继续轮询
if attempt % self.PROGRESS_LOG_INTERVAL == 0:
print(f"Still waiting for device response... (attempt {attempt}/{max_retries})")
time.sleep(self.POLLING_INTERVAL_SECONDS)
print(f"{self.FAILURE_SYMBOL} Device did not respond after {max_retries * self.POLLING_INTERVAL_SECONDS} seconds of polling")
return False
def firmware_upgrade(self, filepath: str, product_num: str,
rstate: str, dut_position: int = 0,
max_poll_attempts: int = None) -> bool:
"""完整固件升级流程"""
self.start_time = time.time()
print("Starting complete firmware upgrade process...")
# 使用指定的轮询次数
if max_poll_attempts is not None:
self.MAX_POLL_ATTEMPTS = max_poll_attempts
# 1. 执行固件上传
print("\n" + "="*60)
print("Step 1: Uploading firmware...")
print("="*60)
if not self.upload_with_manual_info(filepath, product_num, rstate, dut_position):
print(f"{self.FAILURE_SYMBOL} Firmware upload failed")
return False
upload_time = time.time() - self.start_time
print(f"{self.SUCCESS_SYMBOL} Firmware upload completed in {upload_time:.2f} seconds")
# 2. 等待设备重启并重新上线
print("\n" + "="*60)
print("Step 2: Waiting for device to restart and come back online...")
print("="*60)
if not self.wait_for_device_and_get_sw_info():
print(f"{self.FAILURE_SYMBOL} Device did not come back online or failed to get software info")
return False
total_time = time.time() - self.start_time
print("\n" + "="*60)
print(f"{self.SUCCESS_SYMBOL} Firmware upgrade completed successfully!")
print(f"Total time: {total_time:.2f} seconds")
print("="*60)
return True
def verify_file(self, filepath: str) -> bool:
"""验证文件并计算哈希值"""
if not os.path.exists(filepath):
print(f"{self.FAILURE_SYMBOL} 文件不存在: {filepath}")
return False
filename = os.path.basename(filepath)
filesize = os.path.getsize(filepath)
# 计算所有哈希值
md5_hash = self.calculate_md5_hash(filepath)
print(f"\n{'='*60}")
print("文件验证报告")
print('='*60)
print(f"文件名: {filename}")
print(f"文件大小: {filesize:,} 字节 ({filesize/1024/1024:.2f} MB)")
print(f"修改时间: {datetime.fromtimestamp(os.path.getmtime(filepath))}")
print('='*60)
print("哈希值:")
print(f" MD5 (小写): {md5_hash}")
print('='*60)
print(f"\n{self.DEBUG_SYMBOL} 此服务器期望使用: MD5 (小写)")
return True
def close(self):
"""关闭连接"""
if self.channel:
self.channel.close()
if self.start_time:
elapsed_time = time.time() - self.start_time
print(f"{self.DEBUG_SYMBOL} 连接已关闭,总运行时间: {elapsed_time:.2f} 秒")
else:
print(f"{self.DEBUG_SYMBOL} 连接已关闭")
self.connected = False
def main():
parser = argparse.ArgumentParser(
description='完整固件升级客户端 - 修复版',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog='''
示例:
%(prog)s test localhost:50050
%(prog)s upload localhost:50050 firmware.bin "CXP9024418/1" R1A
%(prog)s upgrade localhost:50050 firmware.bin "CXP9024418/1" R1A
%(prog)s verify firmware.bin
注意: 完整升级流程需要较长时间,建议增加轮询次数
'''
)
subparsers = parser.add_subparsers(dest='command', help='命令', required=True)
# test 命令
test_parser = subparsers.add_parser('test', help='测试服务器连接')
test_parser.add_argument('server', help='服务器地址 (主机:端口)')
test_parser.add_argument('--verbose', '-v', action='store_true', help='详细输出')
# upload 命令
upload_parser = subparsers.add_parser('upload', help='上传固件(不等待重启)')
upload_parser.add_argument('server', help='服务器地址')
upload_parser.add_argument('file', help='固件文件路径')
upload_parser.add_argument('product', help='产品编号')
upload_parser.add_argument('rstate', help='状态代码')
upload_parser.add_argument('--position', '-p', type=int, default=0, help='DUT位置')
upload_parser.add_argument('--verbose', '-v', action='store_true', help='详细输出')
# upgrade 命令
upgrade_parser = subparsers.add_parser('upgrade', help='完整固件升级(上传+等待重启)')
upgrade_parser.add_argument('server', help='服务器地址')
upgrade_parser.add_argument('file', help='固件文件路径')
upgrade_parser.add_argument('product', help='产品编号')
upgrade_parser.add_argument('rstate', help='状态代码')
upgrade_parser.add_argument('--position', '-p', type=int, default=0, help='DUT位置')
upgrade_parser.add_argument('--verbose', '-v', action='store_true', help='详细输出')
upgrade_parser.add_argument('--max-poll', type=int, default=120, help='最大轮询次数(默认120,即10分钟)')
# verify 命令
verify_parser = subparsers.add_parser('verify', help='验证文件哈希值')
verify_parser.add_argument('file', help='文件路径')
verify_parser.add_argument('--verbose', '-v', action='store_true', help='详细输出')
args = parser.parse_args()
client = None
try:
if args.command == 'test':
client = CompleteFirmwareClient(args.server, args.verbose)
if client.connect():
return 0
return 1
elif args.command == 'upload':
if not os.path.exists(args.file):
print(f"{CompleteFirmwareClient.FAILURE_SYMBOL} Error: File not found: {args.file}")
return 1
client = CompleteFirmwareClient(args.server, args.verbose)
if not client.connect():
return 1
success = client.upload_with_manual_info(
args.file, args.product, args.rstate, args.position
)
return 0 if success else 1
elif args.command == 'upgrade':
if not os.path.exists(args.file):
print(f"{CompleteFirmwareClient.FAILURE_SYMBOL} Error: File not found: {args.file}")
return 1
print(f"\n准备执行完整固件升级:")
print(f" 文件: {os.path.basename(args.file)}")
print(f" 大小: {os.path.getsize(args.file):,} 字节")
print(f" 产品: {args.product}")
print(f" 状态: {args.rstate}")
print(f" 位置: {args.position}")
print(f" 最大轮询: {args.max_poll}次 (约{args.max_poll * 5 / 60:.1f}分钟)")
confirm = input("\n确认执行完整升级流程? (y/N): ").strip().lower()
if confirm not in ['y', 'yes', '是']:
print(f"{CompleteFirmwareClient.DEBUG_SYMBOL} 操作已取消")
return 0
client = CompleteFirmwareClient(args.server, args.verbose)
if not client.connect():
return 1
success = client.firmware_upgrade(
args.file, args.product, args.rstate, args.position, args.max_poll
)
return 0 if success else 1
elif args.command == 'verify':
if not os.path.exists(args.file):
print(f"{CompleteFirmwareClient.FAILURE_SYMBOL} Error: File not found: {args.file}")
return 1
client = CompleteFirmwareClient("localhost:50050", args.verbose)
success = client.verify_file(args.file)
return 0 if success else 1
except KeyboardInterrupt:
print(f"\n{CompleteFirmwareClient.DEBUG_SYMBOL} 操作被用户中断")
return 130
except Exception as e:
print(f"\n{CompleteFirmwareClient.FAILURE_SYMBOL} 发生未预期错误: {e}")
traceback.print_exc()
return 1
finally:
if client:
client.close()
if __name__ == '__main__':
sys.exit(main())
深入剖析
gRPC 以其高性能和跨语言的特性,成为了现代微服务架构中的宠儿。然而,从掌握 gRPC 的基础概念到构建一个能够在真实世界中稳定运行的客户端,中间还有许多坑需要我们去填。
本文将以一个功能完善的 Python 固件升级客户端 simple_firmware_client.py 为例,深入剖析其代码结构、gRPC 通信模式以及在实战中遇到的挑战与解决方案。无论你是 gRPC 的初学者还是有一定经验的开发者,相信都能从中获得启发。
蓝图:一个优秀客户端的架构设计
一个好的架构是软件成功的基石。我们的固件升级客户端没有将所有逻辑杂糅在一起,而是采用了清晰的、面向对象的设计,将整个项目划分为三个层次分明的模块。
1. 启动与依赖:告别 ImportError
在任何复杂的 Python 项目中,模块导入都是一个潜在的痛点。我们的客户端通过一段精妙的代码,一劳永逸地解决了这个问题。
# 添加 proto_python 目录到 Python 路径
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, os.path.join(current_dir, 'proto_python'))
try:
# 现在可以安全地导入了
import test_interface_pb2 as pb2
import service_ms_test_interface_pb2_grpc as pb2_grpc
# ...
except ImportError:
print("[ERROR] Protobuf 模块未找到,请先编译 .proto 文件")
sys.exit(1)
核心思想:动态地将存放 Protobuf 生成代码的 proto_python 目录添加到 Python 解释器的搜索路径 sys.path 的最前端。这确保了无论你在哪个目录下运行脚本,它总能优先找到我们需要的模块,彻底告别因路径问题导致的 ImportError。
2. 核心引擎:CompleteFirmwareClient 类
这是整个客户端的心脏,一个封装了所有业务逻辑的类。它遵循单一职责原则,将复杂的固件升级流程拆解为一系列清晰、可维护的方法。
__init__(): 构造函数,初始化服务器地址、日志记录器等。connect(): 负责建立 gRPC 连接,并创建与远程服务通信的 Stub (存根)。upload_with_manual_info(): 负责上传固件,是 gRPC 客户端流式 RPC 的绝佳实践。wait_for_device_and_get_sw_info(): 负责轮询设备状态,是 gRPC 一元 RPC 和健壮性设计的典范。firmware_upgrade(): 流程编排方法。它像一位项目经理,按顺序调用上传和轮询方法,清晰地定义了整个业务流程。
3. 指挥中心:main 与 argparse
这是用户与程序交互的入口。通过使用 Python 内置的 argparse 库,我们为客户端提供了一个功能强大且用户友好的命令行接口 (CLI)。
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="gRPC Firmware Upgrade Client")
parser.add_argument("server", help="Server address (e.g., localhost:50051)")
parser.add_argument("-f", "--file", required=True, help="Firmware file path")
# ... 其他参数
args = parser.parse_args()
main(args)
用户只需在命令行中提供必要的参数,main 函数就会驱动 CompleteFirmwareClient 完成所有复杂的任务。
实战:驾驭 gRPC 的两种核心通信模式
理论总是枯燥的,让我们看看 gRPC 在实战中是如何施展拳脚的。
模式一:客户端流式 RPC (Client Streaming RPC) - 优雅地传输大文件
固件文件通常很大,一次性读入内存再发送是不可接受的。客户端流式 RPC 正是为此而生。
它的工作方式:客户端调用一个 RPC 方法时,传递的不是一个请求对象,而是一个生成器 (Generator)。
# 1. 定义一个生成器,它会“边读边传”
def generate_upload_requests(filepath, ...):
# 第一个请求:发送文件的元数据
yield create_metadata_request(...)
# 后续请求:循环读取文件块,并逐个发送
with open(filepath, 'rb') as f:
while chunk := f.read(CHUNK_SIZE):
yield create_data_chunk_request(chunk)
# 2. 在 RPC 调用时传入生成器
response = firmware_stub.FirmwareUpdate(
generate_upload_requests(filepath, ...),
timeout=600 # 设置一个较长的超时
)
优势:这种“流式”传输的模式,使得客户端的内存占用极低,可以轻松应对 GB 级别的超大文件传输,同时还能通过 yield 的间隙方便地实现进度报告。
模式二:一元 RPC (Unary RPC) - 经典的“请求-响应”
这是最简单、最常见的模式,就像一次普通的函数调用。我们用它来查询设备的软件信息。
request = empty_pb2.Empty()
# 发送一个请求,等待一个响应
response = config_stub.GetSWInfos(request, timeout=5)
简单、直接,非常适合状态查询、获取配置等操作。
挑战与超越:解决“僵尸连接”问题
在我们的项目中,最大的挑战来自于:固件升级后,服务器会重启。
问题:服务器重启后,客户端持有的旧 gRPC channel 实际上已经失效,但客户端本身可能并不知情。此时,任何基于这个旧 channel 的 RPC 调用都会失败或无限期等待,我们称之为“僵尸连接”。
解决方案:在轮询设备状态的循环中,我们采取了一个非常巧妙的策略——为每一次尝试都创建一个全新的、临时的 channel。
def wait_for_device_and_get_sw_info(self):
for attempt in range(MAX_POLL_ATTEMPTS):
try:
# 使用 'with' 语句确保临时 channel 被正确关闭
with grpc.insecure_channel(self.server_addr) as channel:
# 在这个全新的 channel 上创建临时 stub
temp_stub = pb2_grpc.TestInterfaceConfigurationServiceStub(channel)
# 发起 RPC 调用
response = temp_stub.GetSWInfos(empty_pb2.Empty(), timeout=2)
# 如果调用成功,说明设备已上线,可以退出循环
print("✓ Device is back online!")
return True
except grpc.RpcError:
# 在设备重启期间,连接失败是正常现象,我们忽略它
print(f"Attempt {attempt+1}: Device not ready yet...")
time.sleep(POLLING_INTERVAL_SECONDS)
print("✗ Device did not respond in time.")
return False
通过在每次循环中都使用一个“一次性”的 channel,我们确保了每一次轮询都是一次全新的连接尝试,从而完美地绕过了服务器重启带来的连接状态不一致问题。这是一种在构建需要与可能重启的服务进行交互的客户端时,非常实用且健壮的设计模式。
结语
通过对 simple_firmware_client.py 的深度剖析,我们不仅学习了如何组织一个清晰、可维护的 gRPC 客户端项目,还掌握了如何运用不同的 RPC 模式来解决实际问题,尤其是如何通过创建临时 channel 的方式来处理棘手的“僵尸连接”问题。
gRPC 的强大之处不仅在于其性能,更在于它提供了一套完整的工具集,让我们能够构建出既优雅又健壮的分布式系统。希望本文的经验能为你未来的 gRPC 之旅扫清一些障碍。
好的,没问题。将我们刚才深入探讨的关于 C++ 和 Python gRPC 客户端在处理服务器重启时行为差异的分析,总结成一篇适合发布在技术博客上的 Markdown 文章,是一个绝佳的主意。
这篇文章将深入探讨一个在 gRPC 实战中非常有趣且关键的问题,能够帮助其他开发者理解不同语言实现之间的细微差别。
gRPC 实战深潜
为何 C++ 能“死等”,而 Python 却要“另起炉灶”?
在使用 gRPC 构建分布式系统时,我们常常会遇到一个棘手的场景:客户端需要与一个可能会重启的服务进行通信。一个典型的例子就是固件升级:客户端上传完固件后,服务器需要重启以应用更新。此时,客户端如何优雅地等待服务器恢复并确认升级成功呢?
最近,在开发一个固件升级工具时,我们发现 C++ 和 Python 的 gRPC 客户端在处理这一场景时,采取了截然不同的策略,这背后揭示了 gRPC 在不同语言实现中的深层差异。
场景设定:等待重启的服务器
我们的业务流程如下:
- 客户端通过 gRPC 连接到设备(服务器)。
- 客户端通过客户端流式 RPC 上传固件文件。
- 服务器接收文件后,执行升级并重启。
- 客户端需要轮询服务器,直到它重新上线,并通过一个一元 RPC (
GetSWInfos) 获取更新后的软件版本信息。
问题的核心在于第 4 步:客户端如何处理那个因服务器重启而失效的 gRPC 连接?
两种策略:C++ 的“原地死等” vs. Python 的“重新请求”
通过分析两个客户端的代码,我们发现了两种截然不同的实现模式。
C++ 客户端:执着的等待者
C++ 客户端的轮询逻辑大致如下:
// 在客户端对象的整个生命周期中,config_stub_ 是复用的
bool SimpleFirmwareClient::WaitForDeviceAndGetSWInfo() {
// ... 初始等待 ...
for (int attempt = 1; attempt <= MAX_POLL_ATTEMPTS; ++attempt) {
try {
// 在循环中,反复使用同一个 stub 对象
grpc::Status status =
config_stub_->GetSWInfos(&context, request, &response);
if (status.ok()) {
// 成功!服务器已上线
return true;
}
} catch (...) {
// 忽略所有异常,继续下一次尝试
}
std::this_thread::sleep_for(std::chrono::seconds(5));
}
return false;
}
核心行为:C++ 客户端在整个轮询过程中,始终复用同一个 stub 对象(它是在客户端初始化时创建的)。即使 RPC 调用因为服务器不可达而失败(status 非 ok),它也只是简单地忽略错误,然后继续下一次循环。它似乎坚信,只要坚持尝试,这个 stub 最终会重新连接成功。我们称之为“原地死等”策略。
Python 客户端:灵活的挑战者
与 C++ 不同,Python 客户端如果采用同样的策略,会遭遇“僵尸连接”问题——即 channel 在服务器重启后进入一个无法自动恢复的失败状态。因此,我们最终采取了另一种更健壮的策略:
# 在轮询的 for 循环内部
def _wait_for_device_and_get_sw_info(self):
for attempt in range(self.MAX_POLL_ATTEMPTS):
try:
# 关键:每次循环都创建一个全新的、临时的 channel
with grpc.insecure_channel(self.server_address) as channel:
# 基于这个新 channel 创建一个临时 stub
temp_stub = pb2_grpc.TestInterfaceConfigurationServiceStub(channel)
# 发起调用
response = temp_stub.GetSWInfos(request, timeout=2)
# 如果能走到这里,说明连接成功了!
return True
except grpc.RpcError:
# 连接失败是预料之中的,忽略并等待下一次重试
pass
time.sleep(self.POLLING_INTERVAL_SECONDS)
return False
核心行为:Python 客户端在每一次轮询尝试中,都会创建一个全新的 channel 和 stub。它不信任旧的连接,而是选择在每次尝试时都“另起炉灶”,发起一次全新的连接请求。我们称之为“重新请求”策略。
为什么会存在这种差异?
C++ 客户端之所以能够“原地死等”,根源在于其 gRPC 实现拥有一个更强大、更持久的后台自动重连机制。
我们可以从以下几个层面来理解:
1. 底层核心库的差异
gRPC 的核心库 grpc-core 是用 C++ 编写的,它是所有语言实现的基础。C++ gRPC 客户端能够更直接、更深度地利用核心库的全部功能。这很可能包括一个更激进、更智能的后台重连策略。
当服务器断开,channel 进入 TRANSIENT_FAILURE (瞬时故障) 状态后,C++ 的底层实现可能会在后台持续、自动地尝试重连,并且其重连的退避算法(backoff strategy)被设计为可以覆盖长达数分钟的中断窗口。当上层应用发起 RPC 调用时,这个调用会被挂起,直到后台重连成功。
相比之下,Python 的 gRPC 库(作为 C++ 核心库的封装)在处理长时间中断时,其默认的重连行为可能相对“保守”,在几次失败后可能就停止尝试,导致 channel 永久失效。
2. 默认 Channel 配置的不同
gRPC channel 的行为可以通过大量参数进行微调。C++ 版本的 gRPC 库可能拥有更有利于长时间重连的默认参数。例如,它可能默认启用了无限重试或一个非常长的重试超时。而 Python 的默认配置可能更倾向于“快速失败”,以避免应用程序被长时间阻塞。
3. 一个形象的比喻
让我们用一个等电梯的比喻来理解这两种策略:
-
C++ 客户端(执着的员工):他走到电梯前,发现电梯正在维修(服务器重启)。他没有离开,而是选择在原地一直等待。他背后的智能管家(gRPC 核心库)会持续监控电梯状态,一旦电梯恢复,立刻通知他。他的策略是“原地死等”,依赖于强大的底层支持。
-
Python 客户端(灵活的员工):他发现电梯在维修,等了一小会儿后,决定先去干点别的事(
time.sleep)。过了一会儿,他重新走到电梯前,像第一次来一样,重新按了一遍按钮。他的策略是“放弃等待,重新发起请求”,通过主动的、应用层的重试来确保成功。
结论与实践建议
这次有趣的对比告诉我们,即使是同一个 gRPC 协议,在不同语言的实现和最佳实践中也可能存在显著差异。
- C++ 的“原地死等” 策略,得益于其强大的底层自动重连机制,代码看起来更简洁。
- Python 的“重新请求” 策略,虽然代码稍显复杂(需要手动创建临时
channel),但它是一种绕过其默认重连机制局限性的、更具确定性和健rotting 的工程实践。对于 Python 开发者来说,这是一种在面对可能重启的服务时,更安全、更可控的选择。
核心启示:在构建与不稳定的、可能重启的服务交互的 gRPC 客户端时,不能盲目地相信持久化连接。如果底层库的自动重连机制不够强大或其行为不符合预期,那么在应用层通过“按需创建、用完即弃”的一次性连接来控制重试逻辑,将是你的最佳选择。
了解这些底层差异,将帮助我们编写出更健壮、更可靠的分布式应用程序。

浙公网安备 33010602011771号