一次AI辅助调试记录(2024年文章补发)

一次AI辅助调试记录

由于家里的网络限制,我无法在线安装ollama,所以选择了离线安装的方法。但是离线安装后默认不提供systemd守护进程配置文件。
于是我自己手写了一个service文件,但是发现它无法正常工作,于是我叫来 WindSurf + Claude 3.5 Sonnet 帮我修改下配置。

2024-1126-ai-debug

调试过程

Me: 这个配置文件无法正常工作,帮我看看为什么

让我查看一下 ollama.service 文件的内容,看看配置有什么问题。

Analyzed ollama.service
我发现了几个可能的问题:

  1. ExecStart 行中的命令路径被不必要地加上了引号。在 systemd 服务文件中,命令路径不应该用引号括起来。这应该改为:

ExecStart=/usr/bin/ollama serve

  1. 需要确保:
  • ollama 用户和组已经创建
  • /usr/bin/ollama 文件存在且有执行权限
  • ollama 用户对必要的目录和文件有适当的权限

让我帮你修复这个配置文件:

Edited ollama.service +1 -2 18 Open diff

让我们检查一下 ollama 是否已正确安装:

Ran terminal command
/data/ollama

$ which ollama

Stdout

/usr/bin/ollama

Ran terminal command
/data/ollama

$ id ollama

Stdout

uid=997(ollama) gid=984(ollama) 组=984(ollama)

好的,我已经:

  1. 修复了服务文件中的语法错误(移除了 ExecStart 中的不必要引号)
  2. 验证了 ollama 命令确实存在于 /usr/bin/ollama
  3. 确认了 ollama 用户和组都已正确创建

现在你可以尝试重新加载和启动服务:

Ran terminal command
/data/ollama

$ sudo systemctl daemon-reload

Ran terminal command
/data/ollama

$ sudo systemctl restart ollama

Ran terminal command
/data/ollama

$ sudo systemctl status ollama

Stdout

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Tue 2024-11-26 19:47:31 CST; 1s ago
    Process: 4210 ExecStart=/usr/bin/ollama serve (code=exited, status=203/EXEC)
   Main PID: 4210 (code=exited, status=203/EXEC)
        CPU: 7ms

看起来服务仍然无法正常启动,错误代码是 203/EXEC。这通常表示执行文件的权限问题或文件不存在。让我们检查一下 ollama 可执行文件的权限:

Ran terminal command
/data/ollama

$ ls -l /usr/bin/ollama

Stdout

-rwxr-xr-x 1 root root 47605784 Nov 22 02:50 /usr/bin/ollama

权限看起来是正确的。让我们看看日志中是否有更详细的错误信息:

Ran terminal command
/data/ollama

$ sudo journalctl -u ollama -n 50

(31 lines above hidden)

Stdout

Nov 26 19:47:35 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 19:47:35 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=203/EXEC
Nov 26 19:47:35 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 19:47:38 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 111.
Nov 26 19:47:38 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 19:47:38 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=203/EXEC
Nov 26 19:47:38 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 19:47:41 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 112.
Nov 26 19:47:41 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 19:47:41 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=203/EXEC
Nov 26 19:47:41 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 19:47:44 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 113.
Nov 26 19:47:44 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 19:47:44 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=203/EXEC
Nov 26 19:47:44 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 19:47:48 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 114.
Nov 26 19:47:48 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 19:47:48 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=203/EXEC
Nov 26 19:47:48 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.

让我们直接尝试手动运行 ollama 命令,看看是否能得到更多信息:

Canceled terminal command
/data/ollama

$ sudo -u ollama /usr/bin/ollama serve

Stdout

Couldn't find '/usr/share/ollama/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPXQYcZzcST34AiKlMQOAKIRP7ILWd0n2vDjjHwWkFYY

Stderr

2024/11/26 19:47:58 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-26T19:47:58.849+08:00 level=INFO source=images.go:756 msg="total blobs: 0"
time=2024-11-26T19:47:58.849+08:00 level=INFO source=images.go:763 msg="total unused blobs removed: 0"
time=2024-11-26T19:47:58.849+08:00 level=INFO source=routes.go:1248 msg="Listening on 127.0.0.1:11434 (version 0.4.3)"
time=2024-11-26T19:47:58.851+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1910340466/runners
time=2024-11-26T19:47:59.051+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
time=2024-11-26T19:47:59.051+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-26T19:47:59.204+08:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-0898685e-ccfe-27f5-5ce2-1c93f67e332a library=cuda variant=v12 compute=7.5 driver=12.4 name="NVIDIA GeForce RTX 2080 Ti" total="21.7 GiB" available="21.3 GiB"

Me: no problem when run in terminal

如果在终端中直接运行是正常的,但通过 systemd 服务运行失败,这可能与环境变量或目录权限有关。让我们修改服务文件,添加一些必要的配置:

Edited ollama.service +2 -0 19 Open diff

主要改动:

  1. 添加了 Type=simple 指定服务类型
  2. 添加了 WorkingDirectory=/home/ollama 确保服务有一个固定的工作目录

让我们确保工作目录存在并且权限正确:

Ran terminal command
/data/ollama

$ sudo mkdir -p /home/ollama

Canceled terminal command
/data/ollama

$ sudo chown ollama:ollama /home/ollama

Your modified files:
ollama.service /data/ollama +2 -2

(这个时候我干预了一下,把/home/ollama改成了我实际的目录/data/ollama)

Me: llama.service: Changing to the requested working directory failed: Permission d>
a.service: Main process exited, code=exited, status=200/CHDIR
a.service: Failed with result 'exit-code'.

看起来问题出在目录权限上。错误信息显示无法切换到工作目录 /data/ollama,这是因为 ollama 用户没有该目录的访问权限。让我们检查并修复这个权限问题:

Ran terminal command
/data/ollama

$ ls -ld /data/ollama

Stdout

drwxr-xr-x 4 ollama ollama 4096 Nov 26 19:34 /data/ollama

Ran terminal command
/data/ollama

$ ls -l /data/ollama/bin/ollama

Stdout

-rwxr-xr-x 1 ollama ollama 47605784 Nov 22 02:50 /data/ollama/bin/ollama

看起来目录和文件的所有权都已经是 ollama 用户了,但我们还需要确保父目录 /data 也有适当的权限:

Ran terminal command
/data/ollama

$ ls -ld /data

Stdout

drwx------ 5 ff ff 4096 Nov 23 19:32 /data

问题找到了!父目录 /data 的权限是 drwx------(700),只有所有者 ff 有权限,这就是为什么 ollama 用户无法访问 /data/ollama 目录。

我们有两个选择:

  1. 给 /data 目录添加执行权限(推荐):
    bash

sudo chmod 711 /data

  1. 或者将服务的工作目录改为 ollama 用户有权限的其他位置。

建议使用第一个方案,因为这样可以保持现有的目录结构。让我们执行这个修改:

Ran terminal command
/data/ollama

$ sudo chmod 711 /data

现在让我们重启服务:

Ran terminal command
/data/ollama

$ sudo systemctl restart ollama

Ran terminal command
/data/ollama

$ sudo systemctl status ollama

Stdout

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Tue 2024-11-26 20:00:29 CST; 1s ago
    Process: 8970 ExecStart=/data/ollama/bin/ollama serve (code=exited, status=1/FAILURE)
   Main PID: 8970 (code=exited, status=1/FAILURE)
        CPU: 17ms

现在错误变成了状态码 1,这表示程序本身的错误。让我们看看详细的日志:

Ran terminal command
/data/ollama

$ sudo journalctl -u ollama -n 50

(31 lines above hidden)

Stdout

Nov 26 20:00:29 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 20:00:29 ff-gpu ollama[8970]: Error: listen tcp 0.0.0.0:11434: bind: address already in use
Nov 26 20:00:29 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 20:00:29 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 20:00:32 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 349.
Nov 26 20:00:32 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 20:00:32 ff-gpu ollama[8983]: Error: listen tcp 0.0.0.0:11434: bind: address already in use
Nov 26 20:00:32 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 20:00:32 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 20:00:36 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 350.
Nov 26 20:00:36 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 20:00:36 ff-gpu ollama[9001]: Error: listen tcp 0.0.0.0:11434: bind: address already in use
Nov 26 20:00:36 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 20:00:36 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.
Nov 26 20:00:39 ff-gpu systemd[1]: ollama.service: Scheduled restart job, restart counter is at 351.
Nov 26 20:00:39 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 20:00:39 ff-gpu ollama[9011]: Error: listen tcp 0.0.0.0:11434: bind: address already in use
Nov 26 20:00:39 ff-gpu systemd[1]: ollama.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 20:00:39 ff-gpu systemd[1]: ollama.service: Failed with result 'exit-code'.

现在我们看到新的错误:Error: listen tcp 0.0.0.0:11434: bind: address already in use

这说明端口 11434 已经被占用了。让我们看看是什么进程在使用这个端口:

Ran terminal command
/data/ollama

$ sudo lsof -i :11434

Stdout

COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ollama  4234 ollama    3u  IPv4  28285      0t0  TCP localhost:11434 (LISTEN)

找到了!有一个旧的 ollama 进程(PID 4234)还在运行。让我们先停止这个进程:

Ran terminal command
/data/ollama

$ sudo kill 4234

现在让我们再次尝试启动服务:

Ran terminal command
/data/ollama

$ sudo systemctl restart ollama

Ran terminal command
/data/ollama

$ sudo systemctl status ollama

Stdout

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-11-26 20:01:54 CST; 7s ago
   Main PID: 9249 (ollama)
      Tasks: 10 (limit: 28703)
     Memory: 68.3M (peak: 80.4M)
        CPU: 659ms
     CGroup: /system.slice/ollama.service
             └─9249 /data/ollama/bin/ollama serve

Nov 26 20:01:54 ff-gpu systemd[1]: Started ollama.service - Ollama Service.
Nov 26 20:01:54 ff-gpu ollama[9249]: 2024/11/26 20:01:54 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.120+08:00 level=INFO source=images.go:756 msg="total blobs: 0"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.120+08:00 level=INFO source=images.go:763 msg="total unused blobs removed: 0"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.120+08:00 level=INFO source=routes.go:1248 msg="Listening on [::]:11434 (version 0.4.3)"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.121+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1798818721/runners
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.343+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.344+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
Nov 26 20:01:54 ff-gpu ollama[9249]: time=2024-11-26T20:01:54.485+08:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-0898685e-ccfe-27f5-5ce2-1c93f67e332a library=cuda variant=v12 compute=7.5 driver=12.4 name="NVIDIA GeForce RTX 2080 Ti" total="21.7 GiB" available="21.3 GiB"

太好了!服务现在已经成功启动并正在运行。让我们总结一下我们解决的问题:

修复了父目录 /data 的权限问题,通过添加执行权限(711)使得 ollama 用户可以访问 /data/ollama 目录
发现并停止了占用端口 11434 的旧 ollama 进程

现在服务已经正常运行,并且:

  • 监听在 0.0.0.0:11434
  • 成功检测到 NVIDIA GeForce RTX 2080 Ti GPU
  • 使用 CUDA v12 作为推理后端
  • 服务版本为 0.4.3

你现在应该可以正常使用 ollama 服务了。如果还有任何问题,请随时告诉我。

Me: THNX

欢迎你!我很高兴能帮助您解决有关ollama服务的问题。该服务现在已正确配置并在GPU支持下运行。如果您需要任何进一步的帮助,请告诉我!

总结

整个过程,我都是使用自然语言沟通,AI先分析问题的可能原因,然后在我授权的情况下,自行执行相应的命令,查看输入结果来验证自己的假设,并通过修改配置文件,进一步推导问题原因,并最终成功解决了问题。

我觉得,这就是未来工作的雏型。人类只需要使用自己的语言,技术性、繁琐性的事情,全部可以由AI来代劳。编程语言,尤其是语言的语法不再重要,因为只要AI可以理解,它就可以实现正确的语法。人类所要做的唯一的事情,就是描述自己的需求,AI来帮你实现一切。

参考

posted on 2026-06-04 08:53  ffb  阅读(1)  评论(0)    收藏  举报

导航