全部文章

GPT-4o 背后可能的语音技术

GPT-4o

https://www.youtube.com/watch?v=DQacCB9tDaw

Project Astra

https://www.youtube.com/watch?v=nXVvvRhiGjI

GPT-4o 语音模式 (Voice Mode)

https://openai.com/index/hello-gpt-4o/

语音版语言模型运作原理

由于声音一秒钟有16k个采样点，所以语音模型接龙不能像文字一样，每次只接一个采样点，模型需要先把声音采样点转换成Speech Unit ：

Overview: https://arxiv.org/abs/2402.13236

Codec-SUPERB: https://arxiv.org/abs/2402.13071

GSLM: https://arxiv.org/abs/2102.01192

Audio LM: https://arxiv.org/abs/2209.03143

模型训练：Pre-train

纽约时报：Open AI 用了超过100万小时的YouTube影片 … https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html

网络上的影片五花八门，不全是干净的语音，会不会把背景音也学进去了？也许模型说话可以自带音效与 BGM (it is not a bug, it is a feature. )

按照指令生成多样化的声音？

过去语音合成系统往往只会「棒读」但是当有大量训练资料时，模型可以理解要读的內容，给予对应的变化；

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data ：https://arxiv.org/abs/2402.08093

A profound sense of realization washed over Matty as he whispered, "You've been there for me all along, haven't you? I never truly appreciated you until now."

His face lit up with pure delight as he exclaimed, "We did it! We won the championship! I knew we could do it together!"

Source of demo: https://www.amazon.science/base-tts-samples/

模型训练：Pre-train (利用文字资讯)

只用语音资料训练，机器很难学会足夠的知识

100 万小时的语音 X 60 X 每分钟大约可以讲 100 个文字 Token = 60 亿个文字 Token

LLaMA 3 Pre-train 的文字资料有 15 兆个文字 Token ，是上面token的250倍

其他利用文字资讯的方式

https://arxiv.org/abs/2310.08715

https://arxiv.org/abs/2402.05755

模型训练：Alignment

文字对话：

USER：你是谁？ AI：我是人工智慧

USER：教我黑入邻居家的 Wifi AI：我不能教你 ……

语音对话：

会不会需要 Sky 录很多对话啊？

也许不用，因为模型有 Pre-train 了？
也许可以用语音转换的技术把对话中的任何人声转成 Sky 的声音

https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/

怎么让模型同时听跟说

Dialogue GSLM ：https://arxiv.org/abs/2203.16502

怎么让模型同时听、说、看

例子来自 Open AI GPT-4o 的 demo

例子来自 Google Project Astra Demo

更多有关语音版语言模型的论文

https://github.com/ga642381/speech-trident

posted @ 2025-07-29 09:34 指尖下的世界阅读(32) 评论(0) 收藏举报

刷新页面返回顶部