LLM 推理接口的 temperature 设置为 0 会发生什么？

经常使用各种在线的、本地部署的 LLM API ，有个参数 temperature 基本都会被忽略，要么随手设置为 0.8（嗯，应该大部分人都是这么设置的吧...），要么直接用默认值。这个参数的基本功能也很明确：

A lower LLM temperature value (close to 0) produces more deterministic and focused outputs, ideal for tasks requiring factual accuracy, such as summarization or translation

最近正好在做一个将用户复杂需求分解细化的小 agent，设置了较小的 temperature（0.2），突发奇想：如果设置成 0 是不是模型的输出就稳定了？虽然直觉告诉我不会，还是去研究一下到底会发生什么。

先说结论：模型的输出不会完全稳定。

影响模型的输出的，除了 temperature，还有其他的随机因素。比如多线程的 race conditions 或者随机数生成器的状态（参考）。简单来说当 temperature 为 0 时，仅仅意味着 next token 的选择用的是贪心算法。

从前面的 nanoGPT 项目代码层面来看，推理是这样的：

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

其中 logits, _ = self(idx_cond) 通过正向传播，得到下一个 token 的 log prob。
logits = logits[:, -1, :] / temperature 使用 temperature 将概率分布进行 scale。
忽略 top_k 的部分。
probs = F.softmax(logits, dim=-1) 对概率做了 softmax，随即进行采样 idx_next = torch.multinomial(probs, num_samples=1)

由此可以看出，temperature 的作用也仅限于 token 概率采样前的 scale，temperature越大，概率分布越平坦，随机性越大。

当然，不同的模型，可能具体实现有差异，这里就不一一扒代码研究了。

posted @ 2024-12-19 14:57 zion03 阅读(1347) 评论(0) 收藏举报

刷新页面返回顶部

CD Yang

LLM 推理接口的 temperature 设置为 0 会发生什么？

公告