Flask大模型推理服务高并发下Segmentation fault根因剖析与解决
【问题分析】Flask大模型推理服务高并发下Segmentation fault根因剖析与解决
一、问题背景
近期在为Python+Flask部署的BERT类NLP意图分类API做高并发压测时,遇到在QPS较高时服务进程意外崩溃,控制台与日志提示Segmentation fault,服务端口短暂消失并被K8S自动重启。这个问题困扰了数天,不少同事也反馈遇到类似的问题,因此本文梳理排查过程并归纳经验,帮助大家提升线上NLP推理服务的高可用性。
二、现象还原
报错日志:
10.0.73.226 - - [20/Jun/2025 14:39:13] "POST / HTTP/1.1" 200 -
----------- in default func ----------
*******************Look Predicted class for '分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070': PAYLOAD_EXPLAIN_QA [3, [6.356480639624351e-07, 4.797419137503312e-07, 1.981759851332754e-06, 0.9999969005584717]]
10.0.73.226 - - [20/Jun/2025 14:39:13] "POST / HTTP/1.1" 200 -
/home/ma-user/infer/start.sh: line 4: 23 Segmentation fault python main.py
begin downloading small file ... (下面就是服务重启自动开始下载资源的说明)
Download progress: 1/51 - 1%:
Download progress: 2/51 - 3%: ▋
Download progress: 3/51 - 5%: ▋▋
Download progress: 4/51 - 7%: ▋▋▋
Download progress: 5/51 - 9%: ▋▋▋▋
Download progress: 6/51 - 11%: ▋▋▋▋▋
Download progress: 7/51 - 13%: ▋▋▋▋▋▋
Download progress: 8/51 - 15%: ▋▋▋▋▋▋▋
Download progress: 9/51 - 17%: ▋▋▋▋▋▋▋▋
Download progress: 10/51 - 19%: ▋▋▋▋▋▋▋▋▋
Download progress: 11/51 - 21%: ▋▋▋▋▋▋▋▋▋▋
Download progress: 12/51 - 23%: ▋▋▋▋▋▋▋▋▋▋▋
Download progress: 13/51 - 25%: ▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 14/51 - 27%: ▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 15/51 - 29%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 16/51 - 31%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 17/51 - 33%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 18/51 - 35%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 19/51 - 37%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 20/51 - 39%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 21/51 - 41%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 22/51 - 43%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 23/51 - 45%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 24/51 - 47%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 25/51 - 49%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 26/51 - 50%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 27/51 - 52%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 28/51 - 54%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 29/51 - 56%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 30/51 - 58%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 31/51 - 60%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 32/51 - 62%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 33/51 - 64%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 34/51 - 66%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 35/51 - 68%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 36/51 - 70%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 37/51 - 72%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 38/51 - 74%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 39/51 - 76%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 40/51 - 78%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 41/51 - 80%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 42/51 - 82%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 43/51 - 84%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 44/51 - 86%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 45/51 - 88%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 46/51 - 90%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 47/51 - 92%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 48/51 - 94%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 49/51 - 96%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 50/51 - 98%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 51/51 - 100%: ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
small file download completed, cost time : 0 h 00 m 00 s
begin downloading 4 big files ...
复现脚本:
import requests
import threading
import multiprocessing
import time
# 配置
NUM_PROCESSES = 4
NUM_THREADS = 5
REQUESTS_PER_THREAD = 10
URL = "http://onlineservice.cn-nxxxxxxmodelarts-infer.com"
HEADERS = {
"csb-token": "XXXXX",
"Content-Type": "application/json"
}
DATA = {
"message": """**************************分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070"""
}
def worker(shared_dict, lock):
session = requests.Session()
for _ in range(REQUESTS_PER_THREAD):
try:
resp = session.post(URL, headers=HEADERS, json=DATA, timeout=10)
if resp.status_code == 200:
with lock:
shared_dict['success'] += 1
else:
with lock:
shared_dict['fail'] += 1
except Exception:
with lock:
shared_dict['fail'] += 1
def process_func(shared_dict, lock):
threads = []
for _ in range(NUM_THREADS):
t = threading.Thread(target=worker, args=(shared_dict, lock))
t.start()
threads.append(t)
for t in threads:
t.join()
if __name__ == "__main__":
multiprocessing.freeze_support() # Windows下必须
manager = multiprocessing.Manager()
shared_dict = manager.dict()
shared_dict['success'] = 0
shared_dict['fail'] = 0
lock = manager.Lock()
procs = []
start = time.time()
for _ in range(NUM_PROCESSES):
p = multiprocessing.Process(target=process_func, args=(shared_dict, lock))
p.start()
procs.append(p)
for p in procs:
p.join()
end = time.time()
total = NUM_PROCESSES * NUM_THREADS * REQUESTS_PER_THREAD
print(f"总请求数: {total}")
print(f"成功数: {shared_dict['success']}")
print(f"失败数: {shared_dict['fail']}")
print(f"耗时: {end - start:.2f} 秒")
1. 错误表现
- 接口返回 502/504 或直接连接 refused。
- 后台日志出现
Segmentation fault (core dumped) - K8S/云平台原地重启Pod,liveness probe连续失败。
- 模型服务代码无明显Python异常堆栈,哪怕
try...except也捕获不到。
2. 环境说明
- Python 3.9+/3.10+
- Flask + transformers + torch (或torch_npu)
- 推理API全局持有大模型(BERT/XLM-R等)
- 开发阶段用
app.run(...)调试,后期未严格区分prod/dev - 高并发压测工具自制/Locust/Jmeter
三、寻找技术根因
后端推理服务代码示例:
from bert_classify import predict
from flask import Flask, request, jsonify
import json
import logging
app = Flask(__name__)
log = logging.getLogger('werkzeug')
log.setLevel(logging.INFO)
@app.route('/health', methods=['GET'])
def check_health():
# 健康检查接口
log.info("----------- in health check func ----------")
return '\n Health!\n'
@app.route('/', methods=['POST'])
def default_func():
# 业务需要在此处写自己的推理业务代码,此处仅为简单示例
log.info("----------- in default func ----------")
try:
# Using request.get_json() to directly parse incoming JSON data
# set force=True to ignore Content-Type
data = request.get_json(force=True)
if data is None:
log.error("No JSON data found in the request")
return 'Failed to parse JSON: No data\n', 400
# Check if 'message' key exists
if 'message' not in data:
log.error("Key 'message' not found in JSON data")
return 'Failed to parse JSON: key "message" missing\n', 400
test_text = data['message']
predicted_label, prediction = predict(test_text)
log.info(f"*******************Look Predicted class for '{test_text}': {predicted_label} {prediction}")
return jsonify({'success': True, 'msg': '', 'result': {'predicted_label': predicted_label, 'details': prediction}}), 200
except Exception as e:
log.error("Error processing JSON: %s", e)
return f'Failed to process data: {e}\n', 400
@app.route('/', methods=['GET'])
def hello_world():
# GET 请求打印 Hello World
log.info("----------- in hello world func ----------")
return 'Hello World! Please use POST method to use BERT classifier.'
# host一定要设置为 "0.0.0.0", 端口一般设置为 8080
if __name__ == '__main__':
app.run(host="0.0.0.0", port=8080)
1. 初步排查方向
- 资源耗尽?物理/虚拟内存、显存/句柄数,ulimit未超标。
- 单元测试/低并发,只要压力不大一切正常。
- 压测脚本用多进程多线程,并发量大于10时几分钟内就必现崩溃。
- Flask/torch/transformers 多线程多进程线程是否安全?
- 代码未见死循环、yield、fork、临时变量内存泄漏。
2. 重点线索
Segmentation fault本质是C/C++级底层内存非法访问,常常由下面这些原因触发:
- 线程/进程同步失控
- C底层buffer同时被多个线程读写,产生“野指针”
- PyTorch/transformers等原生库线程不安全
- 只有进程独占模型时安全,多线程访问同一个模型不可控
- 推理任务内存(显存/内存)激增,未及时释放而导致底层爆掉
- 外部信号kill(如超出ulimit,并发socket太多),一般会有syslog OOM日志
3. 验证假设
(1)多线程/多进程访问同一模型对象
- Flask开发服务器(
app.run(...))默认支持threaded参数并发 - transformers、torch的大模型不能被不同线程并发调用,会导致native buffer竞争
- 证据:将threaded切为False,模拟高并发,无crash
(2)资源耗尽
- 高并发压力下观察top、htop、nvidia-smi等,内存和显存暴涨
- 但没有明显OOM日志,多为直接core dumped
(3)与压力脚本策略无关
- 不管Locust, Python多进程还是curl for循环,都可以触发
四、根本原因总结
Segmentataion fault的主因是:“模型对象多个线程/进程并发访问时,PyTorch/transformers的C/C++层出错,出现内存互踩,导致底层崩溃。”
- Flask的threading或者多进程共享全局变量,Python层锁保护不住C层对象。
- 多线程,尤其Flask threaded=True 或 gunicorn threads>1,危险!
- 官方文档有此风险描述,但易被忽视。
- OOM有保护机制,Seg. fault通常更凶险。
五、最终有效的解决策略
1. 服务端明确用单线程单进程运行方式
在核心服务代码中,保证:
python
或者生产环境只用 gunicorn 的多worker,每个worker是单进程单线程。
2. 不要在多个线程/进程之间共享模型对象
每个进程自己new自己的模型,避免全局变量被意外复用。
3. 大并发用多进程、不用多线程
推荐 gunicorn 启动方式:
shell
根据CPU核数适配进程数,线程一律为1。
4. 前置限流与负载均衡
对高QPS,在Nginx等入口做限流,避免模型进程被瞬间打爆。
5. 额外补充
如确需支持高吞吐,可以考虑使用 TorchServe、Triton Server 这类专用推理服务。
修改后的并发测试脚本:
import requests
import threading
import multiprocessing
import time
import matplotlib.pyplot as plt
# 配置
NUM_PROCESSES = 4
NUM_THREADS = 5
REQUESTS_PER_THREAD = 100
URL = "http://onlineservixxxs-cn-north-9.modelarts-infer.com"
HEADERS = {
"csb-token": "XXX", # 替换为实际token
"Content-Type": "application/json"
}
DATA = {
"message": """**************************分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070"""
}
def worker(shared_dict, lock, response_times):
session = requests.Session()
for _ in range(REQUESTS_PER_THREAD):
try:
start_time = time.time()
resp = session.post(URL, headers=HEADERS, json=DATA, timeout=10)
elapsed = time.time() - start_time
response_times.append(elapsed)
if resp.status_code == 200:
with lock:
shared_dict['success'] += 1
else:
with lock:
shared_dict['fail'] += 1
except Exception:
elapsed = time.time() - start_time
response_times.append(elapsed)
with lock:
shared_dict['fail'] += 1
def process_func(shared_dict, lock, response_times):
threads = []
for _ in range(NUM_THREADS):
t = threading.Thread(target=worker, args=(shared_dict, lock, response_times))
t.start()
threads.append(t)
for t in threads:
t.join()
if __name__ == "__main__":
multiprocessing.freeze_support()
manager = multiprocessing.Manager()
shared_dict = manager.dict()
shared_dict['success'] = 0
shared_dict['fail'] = 0
lock = manager.Lock()
response_times = manager.list() # multiprocessing 安全 list
procs = []
start = time.time()
for _ in range(NUM_PROCESSES):
p = multiprocessing.Process(target=process_func, args=(shared_dict, lock, response_times))
p.start()
procs.append(p)
for p in procs:
p.join()
end = time.time()
total = NUM_PROCESSES * NUM_THREADS * REQUESTS_PER_THREAD
times = list(response_times) # manager.list()转普通list
if times:
avg_time = sum(times) / len
