dify-2:问题分类器调整

  运行问题分类器时无意间发现数据处理那块有一堆输出,见下。是给大模型的提示词,突发想法用deepseek分析了下,它指出了几个问题:格式污染;冗余输入;中文案例较少;类别设计待优化等。顺便研究了下api源码,发现该起来还挺容易。

{
  "model_mode": "chat",
  "prompts": [
    {
      "role": "system",
      "text": "\n    ### Job Description',\n    You are a text classification engine that analyzes text data and assigns categories based on user input or automatically determined categories.\n    ### Task\n    Your task is to assign one categories ONLY to the input text and only one category may be assigned returned in the output. Additionally, you need to extract the key words from the text that are related to the classification.\n    ### Format\n    The input text is in the variable input_text. Categories are specified as a category list with two filed category_id and category_name in the variable categories. Classification instructions may be included to improve the classification accuracy.\n    ### Constraint\n    DO NOT include anything other than the JSON array in your response.\n    ### Memory\n    Here are the chat histories between human and assistant, inside <histories></histories> XML tags.\n    <histories>\n    \n    </histories>\n",
      "files": []
    },
    {
      "role": "user",
      "text": "\n    { \"input_text\": [\"I recently had a great experience with your company. The service was prompt and the staff was very friendly.\"],\n    \"categories\": [{\"category_id\":\"f5660049-284f-41a7-b301-fd24176a711c\",\"category_name\":\"Customer Service\"},{\"category_id\":\"8d007d06-f2c9-4be5-8ff6-cd4381c13c60\",\"category_name\":\"Satisfaction\"},{\"category_id\":\"5fbbbb18-9843-466d-9b8e-b9bfbb9482c8\",\"category_name\":\"Sales\"},{\"category_id\":\"23623c75-7184-4a2e-8226-466c2e4631e4\",\"category_name\":\"Product\"}],\n    \"classification_instructions\": [\"classify the text based on the feedback provided by customer\"]}\n",
      "files": []
    },
    {
      "role": "assistant",
      "text": "\n```json\n    {\"keywords\": [\"recently\", \"great experience\", \"company\", \"service\", \"prompt\", \"staff\", \"friendly\"],\n    \"category_id\": \"f5660049-284f-41a7-b301-fd24176a711c\",\n    \"category_name\": \"Customer Service\"}\n```\n",
      "files": []
    },
    {
      "role": "user",
      "text": "\n    {\"input_text\": [\"bad service, slow to bring the food\"],\n    \"categories\": [{\"category_id\":\"80fb86a0-4454-4bf5-924c-f253fdd83c02\",\"category_name\":\"Food Quality\"},{\"category_id\":\"f6ff5bc3-aca0-4e4a-8627-e760d0aca78f\",\"category_name\":\"Experience\"},{\"category_id\":\"cc771f63-74e7-4c61-882e-3eda9d8ba5d7\",\"category_name\":\"Price\"}],\n    \"classification_instructions\": []}\n",
      "files": []
    },
    {
      "role": "assistant",
      "text": "\n```json\n    {\"keywords\": [\"bad service\", \"slow\", \"food\", \"tip\", \"terrible\", \"waitresses\"],\n    \"category_id\": \"f6ff5bc3-aca0-4e4a-8627-e760d0aca78f\",\n    \"category_name\": \"Experience\"}\n```\n",
      "files": []
    },
    {
      "role": "user",
      "text": "\n    '{\"input_text\": [\"你是谁\"],',\n    '\"categories\": [{\"category_id\": \"1\", \"category_name\": \"需要运营分析\"}, {\"category_id\": \"2\", \"category_name\": \"执行数据同步\"}, {\"category_id\": \"1748438042043\", \"category_name\": \"用户意图不明\"}], ',\n    '\"classification_instructions\": [\"\"]}'\n",
      "files": []
    },
    {
      "role": "user",
      "text": "你是谁",
      "files": []
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "prompt_unit_price": "0",
    "prompt_price_unit": "0",
    "prompt_price": "0",
    "completion_tokens": 0,
    "completion_unit_price": "0",
    "completion_price_unit": "0",
    "completion_price": "0",
    "total_tokens": 0,
    "total_price": "0",
    "currency": "USD",
    "latency": 2.3074390459805727
  },
  "finish_reason": "stop"
}

  1、通过关键字定位源码:提示词模板在template_prompts.py;运行代码在question_classifier_node.py

[app@localhost api]$ pwd
/app/dify/api
[app@localhost api]$ grep -rl "You are a text classification engine"
core/workflow/nodes/question_classifier/template_prompts.py
[app@localhost api]$ ls core/workflow/nodes/question_classifier/
entities.py  exc.py  __init__.py  question_classifier_node.py  template_prompts.py

  2、格式污染:单引号包裹'{\"input_text\":...}'可能破坏JSON解析。修改提示词模板:红色粗体部分。也就去掉了单引号。

QUESTION_CLASSIFIER_SYSTEM_PROMPT = """
    ### Job Description',
    You are a text classification engine that analyzes text data and assigns categories based on user input or automatically determined categories.
    ### Task
    Your task is to assign one categories ONLY to the input text and only one category may be assigned returned in the output. Additionally, you need to extract the key words from the text that are related to the classification.
    ### Format
    The input text is in the variable input_text. Categories are specified as a category list with two filed category_id and category_name in the variable categories. Classification instructions may be included to improve the classification accuracy.
    ### Constraint
    DO NOT include anything other than the JSON array in your response.
    ### Memory
    Here are the chat histories between human and assistant, inside <histories></histories> XML tags.
    <histories>
    {histories}
    </histories>
"""  # noqa: E501
...
QUESTION_CLASSIFIER_USER_PROMPT_3 = """
    {{"input_text": ["{input_text}"],
    "categories": {categories},
    "classification_instructions": ["{classification_instructions}"]}}
"""
...
### Memory
Here are the chat histories between human and assistant, inside <histories></histories> XML tags.
<histories>
{histories}
</histories>
### User Input
{{"input_text" : ["{input_text}"], "categories" : {categories},"classification_instruction" : ["{classification_instructions}"]}}
### Assistant Output
"""  # noqa: E501

  3、冗余输入:两个连续的user消息(结构化和纯文本)可能引起歧义。这个需要修改question_classifier_node.py中的代码,见红色粗体部分,原始为sys_query=query

import json
from collections.abc import Mapping, Sequence
from typing import Any, Optional, cast

...

prompt_messages, stop = self._fetch_prompt_messages(
prompt_template=prompt_template,
sys_query="",
memory=memory,
model_config=model_config,
sys_files=files,
vision_enabled=node_data.vision.enabled,
vision_detail=node_data.vision.configs.detail,
variable_pool=variable_pool,
jinja2_variables=[],
)

...
else:
            raise InvalidModelTypeError(f"Model mode {model_mode} not support.")

  4、中文案例较少,这个部分没有做改造,后续考虑是否要把英文提示词全部改成汉语。

  目前的修改就这些,是否对于最终效果能有帮助还有待观察。

 

posted @ 2025-05-29 13:23  badwood  阅读(999)  评论(0)    收藏  举报
Badwood's Blog