【转载】我能这样说吗？我有点讨厌强化学习

我能这样说吗？我有点讨厌强化学习

讨论
我所有的机器学习工作经验都集中在监督学习上。我很欣赏用 Torch 构建和测试模型的简单性——我不需要担心添加新层或调整数据集。但强化学习就完全不同了。最近，我有“幸”体验了一下它的工作流程。

首先，如果不并行化环境，你就无法训练出一个好的模型。而且这不仅需要强大的 CPU，还会占用更多 GPU 内存，因为要存储所有的状态信息。
其次，自己构建模型简直是个噩梦。我说的是当前的 SOTA（最先进技术）——actor-critic 类型的模型。你需要训练两个相互依赖的模型，这会导致训练损失疯狂波动。而且，我到现在都不明白该如何正确计算损失，更别说回传梯度了，因为强化学习中并没有明确的“对”或“错”的答案。对我来说，这简直像是某种魔法。
最后，我遇到的所有笔记本示例都使用 Gym 来创建环境，但当你想要自定义奖励类型或在 step() 方法中更改某些输入特征时，这几乎变得毫无意义。相比监督学习，强化学习唯一“值得商榷”的优势似乎就是能够适应混乱变化的实时数据。

我开始明白为什么大家都更喜欢监督学习了。

Am I allowed to say that? I kinda hate Reinforcement Learning
Discussion
All my ml work experience was all about supervised learning. I admire the simplicity of building and testing Torch model, I don't have to worry about adding new layers or tweaking with dataset. Unlike RL. Recently I had a "pleasure" to experience it's workflow. To begin with, you can't train a good model without parallelising environments. And not only it requires good cpu but it also eats more GPU memory, storing all those states. Secondly, building your own model is pain in the ass. I am talking about current SOTA -- actor-critic type. You have to train two models that are dependant on each other and by that training loss can jump like crazy. And I still don't understand how to actually count loss and moreover backpropagate it since we have no right or wrong answer. Kinda magic for me. And lastly, all notebooks I've come across uses gym ro make environments, but this is close to pointless at the moment you would want to write your very own reward type or change some in-features to model in step(). It seems that it's only QUESTIONABLE advantage before supervised learning is to adapt to chaotically changing real-time data. I am starting to understand why everyone prefers supervised.

posted on 2024-12-14 12:35 Angry_Panda 阅读(28) 评论(0) 收藏举报

刷新页面返回顶部

Angry Panda（T-800）

【转载】我能这样说吗？我有点讨厌强化学习

公告

导航