Flask大模型推理服务高并发下Segmentation fault根因剖析与解决

【问题分析】Flask大模型推理服务高并发下Segmentation fault根因剖析与解决

一、问题背景

近期在为Python+Flask部署的BERT类NLP意图分类API做高并发压测时,遇到在QPS较高时服务进程意外崩溃,控制台与日志提示Segmentation fault,服务端口短暂消失并被K8S自动重启。这个问题困扰了数天,不少同事也反馈遇到类似的问题,因此本文梳理排查过程并归纳经验,帮助大家提升线上NLP推理服务的高可用性。


二、现象还原

报错日志:

10.0.73.226 - - [20/Jun/2025 14:39:13] "POST / HTTP/1.1" 200 -
----------- in default func ----------
*******************Look Predicted class for '分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070': PAYLOAD_EXPLAIN_QA [3, [6.356480639624351e-07, 4.797419137503312e-07, 1.981759851332754e-06, 0.9999969005584717]]
10.0.73.226 - - [20/Jun/2025 14:39:13] "POST / HTTP/1.1" 200 -
/home/ma-user/infer/start.sh: line 4:    23 Segmentation fault      python main.py
begin downloading small file ...  (下面就是服务重启自动开始下载资源的说明)
 
Download progress: 1/51 - 1%:  
Download progress: 2/51 - 3%:  ▋
Download progress: 3/51 - 5%:  ▋▋
Download progress: 4/51 - 7%:  ▋▋▋
Download progress: 5/51 - 9%:  ▋▋▋▋
Download progress: 6/51 - 11%:  ▋▋▋▋▋
Download progress: 7/51 - 13%:  ▋▋▋▋▋▋
Download progress: 8/51 - 15%:  ▋▋▋▋▋▋▋
Download progress: 9/51 - 17%:  ▋▋▋▋▋▋▋▋
Download progress: 10/51 - 19%:  ▋▋▋▋▋▋▋▋▋
Download progress: 11/51 - 21%:  ▋▋▋▋▋▋▋▋▋▋
Download progress: 12/51 - 23%:  ▋▋▋▋▋▋▋▋▋▋▋
Download progress: 13/51 - 25%:  ▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 14/51 - 27%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 15/51 - 29%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 16/51 - 31%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 17/51 - 33%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 18/51 - 35%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 19/51 - 37%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 20/51 - 39%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 21/51 - 41%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 22/51 - 43%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 23/51 - 45%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 24/51 - 47%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 25/51 - 49%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 26/51 - 50%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 27/51 - 52%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 28/51 - 54%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 29/51 - 56%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 30/51 - 58%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 31/51 - 60%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 32/51 - 62%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 33/51 - 64%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 34/51 - 66%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 35/51 - 68%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 36/51 - 70%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 37/51 - 72%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 38/51 - 74%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 39/51 - 76%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 40/51 - 78%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 41/51 - 80%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 42/51 - 82%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 43/51 - 84%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 44/51 - 86%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 45/51 - 88%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 46/51 - 90%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 47/51 - 92%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 48/51 - 94%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 49/51 - 96%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 50/51 - 98%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
Download progress: 51/51 - 100%:  ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
small file download completed, cost time : 0 h 00 m 00 s
begin downloading 4 big files ...

复现脚本:

import requests
import threading
import multiprocessing
import time

# 配置
NUM_PROCESSES = 4
NUM_THREADS = 5
REQUESTS_PER_THREAD = 10

URL = "http://onlineservice.cn-nxxxxxxmodelarts-infer.com"
HEADERS = {
    "csb-token": "XXXXX",
    "Content-Type": "application/json"
}
DATA = {
    "message": """**************************分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070"""
}

def worker(shared_dict, lock):
    session = requests.Session()
    for _ in range(REQUESTS_PER_THREAD):
        try:
            resp = session.post(URL, headers=HEADERS, json=DATA, timeout=10)
            if resp.status_code == 200:
                with lock:
                    shared_dict['success'] += 1
            else:
                with lock:
                    shared_dict['fail'] += 1
        except Exception:
            with lock:
                shared_dict['fail'] += 1

def process_func(shared_dict, lock):
    threads = []
    for _ in range(NUM_THREADS):
        t = threading.Thread(target=worker, args=(shared_dict, lock))
        t.start()
        threads.append(t)
    for t in threads:
        t.join()

if __name__ == "__main__":
    multiprocessing.freeze_support()  # Windows下必须
    manager = multiprocessing.Manager()
    shared_dict = manager.dict()
    shared_dict['success'] = 0
    shared_dict['fail'] = 0
    lock = manager.Lock()

    procs = []
    start = time.time()
    for _ in range(NUM_PROCESSES):
        p = multiprocessing.Process(target=process_func, args=(shared_dict, lock))
        p.start()
        procs.append(p)
    for p in procs:
        p.join()
    end = time.time()

    total = NUM_PROCESSES * NUM_THREADS * REQUESTS_PER_THREAD
    print(f"总请求数: {total}")
    print(f"成功数:   {shared_dict['success']}")
    print(f"失败数:   {shared_dict['fail']}")
    print(f"耗时:     {end - start:.2f} 秒")

 

1. 错误表现

  • 接口返回 502/504 或直接连接 refused。
  • 后台日志出现 Segmentation fault (core dumped)
  • K8S/云平台原地重启Pod,liveness probe连续失败。
  • 模型服务代码无明显Python异常堆栈,哪怕try...except也捕获不到。

2. 环境说明

  • Python 3.9+/3.10+
  • Flask + transformers + torch (或torch_npu)
  • 推理API全局持有大模型(BERT/XLM-R等)
  • 开发阶段用app.run(...)调试,后期未严格区分prod/dev
  • 高并发压测工具自制/Locust/Jmeter

三、寻找技术根因

后端推理服务代码示例:

from bert_classify import predict
from flask import Flask, request, jsonify
import json
import logging

app = Flask(__name__)
log = logging.getLogger('werkzeug')
log.setLevel(logging.INFO)


@app.route('/health', methods=['GET'])
def check_health():
    # 健康检查接口
    log.info("----------- in health check func ----------")
    return '\n Health!\n'


@app.route('/', methods=['POST'])
def default_func():
    # 业务需要在此处写自己的推理业务代码,此处仅为简单示例
    log.info("----------- in default func ----------")
    try:
        # Using request.get_json() to directly parse incoming JSON data
        # set force=True to ignore Content-Type
        data = request.get_json(force=True)
        if data is None:
            log.error("No JSON data found in the request")
            return 'Failed to parse JSON: No data\n', 400

        # Check if 'message' key exists
        if 'message' not in data:
            log.error("Key 'message' not found in JSON data")
            return 'Failed to parse JSON: key "message" missing\n', 400

        test_text = data['message']
        predicted_label, prediction  = predict(test_text)
        log.info(f"*******************Look Predicted class for '{test_text}': {predicted_label} {prediction}")

        return jsonify({'success': True, 'msg': '', 'result': {'predicted_label': predicted_label, 'details': prediction}}), 200


    except Exception as e:
        log.error("Error processing JSON: %s", e)
        return f'Failed to process data: {e}\n', 400


@app.route('/', methods=['GET'])
def hello_world():
    # GET 请求打印 Hello World
    log.info("----------- in hello world func ----------")
    return 'Hello World! Please use POST method to use BERT classifier.'


# host一定要设置为 "0.0.0.0", 端口一般设置为 8080
if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8080)

  

1. 初步排查方向

  • 资源耗尽?物理/虚拟内存、显存/句柄数,ulimit未超标。
  • 单元测试/低并发,只要压力不大一切正常。
  • 压测脚本用多进程多线程,并发量大于10时几分钟内就必现崩溃。
  • Flask/torch/transformers 多线程多进程线程是否安全?
  • 代码未见死循环、yield、fork、临时变量内存泄漏。

2. 重点线索

Segmentation fault本质是C/C++级底层内存非法访问,常常由下面这些原因触发:

  • 线程/进程同步失控
    • C底层buffer同时被多个线程读写,产生“野指针”
  • PyTorch/transformers等原生库线程不安全
    • 只有进程独占模型时安全,多线程访问同一个模型不可控
  • 推理任务内存(显存/内存)激增,未及时释放而导致底层爆掉
  • 外部信号kill(如超出ulimit,并发socket太多),一般会有syslog OOM日志

3. 验证假设

(1)多线程/多进程访问同一模型对象

  • Flask开发服务器(app.run(...))默认支持threaded参数并发
  • transformers、torch的大模型不能被不同线程并发调用,会导致native buffer竞争
  • 证据:将threaded切为False,模拟高并发,无crash

(2)资源耗尽

  • 高并发压力下观察top、htop、nvidia-smi等,内存和显存暴涨
  • 但没有明显OOM日志,多为直接core dumped

(3)与压力脚本策略无关

  • 不管Locust, Python多进程还是curl for循环,都可以触发

四、根本原因总结

Segmentataion fault的主因是:“模型对象多个线程/进程并发访问时,PyTorch/transformers的C/C++层出错,出现内存互踩,导致底层崩溃。”

  • Flask的threading或者多进程共享全局变量,Python层锁保护不住C层对象。
  • 多线程,尤其Flask threaded=True 或 gunicorn threads>1,危险!
  • 官方文档有此风险描述,但易被忽视。
  • OOM有保护机制,Seg. fault通常更凶险。

五、最终有效的解决策略

1. 服务端明确用单线程单进程运行方式
在核心服务代码中,保证:

python
 
 
app.run(host="0.0.0.0", port=8080, threaded=False, processes=1)

或者生产环境只用 gunicorn 的多worker,每个worker是单进程单线程。

2. 不要在多个线程/进程之间共享模型对象
每个进程自己new自己的模型,避免全局变量被意外复用。

3. 大并发用多进程、不用多线程
推荐 gunicorn 启动方式:

shell
 
 
gunicorn -w 4 -b 0.0.0.0:8080 app:app --threads 1

根据CPU核数适配进程数,线程一律为1。

4. 前置限流与负载均衡
对高QPS,在Nginx等入口做限流,避免模型进程被瞬间打爆。

5. 额外补充
如确需支持高吞吐,可以考虑使用 TorchServe、Triton Server 这类专用推理服务。

 

修改后的并发测试脚本:

import requests
import threading
import multiprocessing
import time
import matplotlib.pyplot as plt

# 配置
NUM_PROCESSES = 4
NUM_THREADS = 5
REQUESTS_PER_THREAD = 100

URL = "http://onlineservixxxs-cn-north-9.modelarts-infer.com"
HEADERS = {
    "csb-token": "XXX",   # 替换为实际token
    "Content-Type": "application/json"
}
DATA = {
    "message": """**************************分析载荷 GET /news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot HTTP/1.1
X-Forwarded-For: 31.221.44.209
Client-IP: 31.221.44.209
REMOTE_ADDR: 31.221.44.209
Accept: text/html, application/xhtml+xml, */*
Content-Type: text/html
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0
Referer: http://218.3.16.55:9070/news/js.php?f_id=1)%20UnIoN%20SeLeCt%201,Md5(1234),3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51%23&type=hot
Host: 218.3.16.55:9070"""
}

def worker(shared_dict, lock, response_times):
    session = requests.Session()
    for _ in range(REQUESTS_PER_THREAD):
        try:
            start_time = time.time()
            resp = session.post(URL, headers=HEADERS, json=DATA, timeout=10)
            elapsed = time.time() - start_time
            response_times.append(elapsed)
            if resp.status_code == 200:
                with lock:
                    shared_dict['success'] += 1
            else:
                with lock:
                    shared_dict['fail'] += 1
        except Exception:
            elapsed = time.time() - start_time
            response_times.append(elapsed)
            with lock:
                shared_dict['fail'] += 1

def process_func(shared_dict, lock, response_times):
    threads = []
    for _ in range(NUM_THREADS):
        t = threading.Thread(target=worker, args=(shared_dict, lock, response_times))
        t.start()
        threads.append(t)
    for t in threads:
        t.join()

if __name__ == "__main__":
    multiprocessing.freeze_support()
    manager = multiprocessing.Manager()
    shared_dict = manager.dict()
    shared_dict['success'] = 0
    shared_dict['fail'] = 0
    lock = manager.Lock()
    response_times = manager.list()  # multiprocessing 安全 list

    procs = []
    start = time.time()
    for _ in range(NUM_PROCESSES):
        p = multiprocessing.Process(target=process_func, args=(shared_dict, lock, response_times))
        p.start()
        procs.append(p)
    for p in procs:
        p.join()
    end = time.time()

    total = NUM_PROCESSES * NUM_THREADS * REQUESTS_PER_THREAD
    times = list(response_times)  # manager.list()转普通list
    if times:
        avg_time = sum(times) / len(times)
        max_time = max(times)
        min_time = min(times)
        print(f"最快响应时间: {min_time:.3f} s")
        print(f"最慢响应时间: {max_time:.3f} s")
        print(f"平均响应时间: {avg_time:.3f} s")
        # 画分布图
        # plt.figure(figsize=(10,6))
        # plt.hist(times, bins=60, edgecolor='black')
        # plt.xlabel("Response Time (s)")
        # plt.ylabel("Count")
        # plt.title("接口响应时间分布")
        # plt.grid()
        # plt.show()
        plt.figure(figsize=(10, 6))
        plt.hist(times, bins=60, edgecolor='black')
        plt.xlabel("Response Time (s)")
        plt.ylabel("Count")
        plt.title("Time Histogram")
        plt.grid()
        plt.tight_layout()
        plt.savefig("response_time_hist.png")  # 保存图片
        print("响应时间分布图已保存为 response_time_hist.png")
    else:
        print("没有响应时间数据")

    print(f"总请求数: {total}")
    print(f"成功数:   {shared_dict['success']}")
    print(f"失败数:   {shared_dict['fail']}")
    print(f"总耗时:   {end - start:.2f} 秒")

 

意图分类性能分布图:

 

最快响应时间: 0.080 s
最慢响应时间: 0.749 s
平均响应时间: 0.709 s
响应时间分布图已保存为 response_time_hist.png
总请求数: 2000
成功数:   2000
失败数:   0
总耗时:   72.24 秒


六、经验结论

  • 大模型推理服务的安全并发一定要用多进程,不用多线程!
  • Flask/gunicorn部署推理类API场景下,threads=1是默认安全线
  • 模型相关持有的全部资源(模型、tokenizer、session等)都要进程隔离、线程串行
  • 开发环境多线程省事,生产环境就是‘地雷’

七、后记

类似的崩溃场景其实也偶见于TensorFlow、ONNX等Python包,核心原因绝大部分都是C++/CUDA对象线程不安全。希望本文痛点记录能帮助遇到类似问题的同学短时间定位和处理问题,避免走弯路。

posted @ 2025-06-23 19:48  bonelee  阅读(59)  评论(0)    收藏  举报