随笔档案「2025年7月」 - fariver

[PaperReading] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

摘要：目录Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large DatasetsTL; DR;DataStage I: Image PretrainingStage II: Curating a Video Pretr 阅读全文

posted @ 2025-07-28 22:24 fariver 阅读(115) 评论(0) 推荐(0)

[PaperReading] Flamingo: a Visual Language Model for Few-Shot Learning

摘要：目录Flamingo: a Visual Language Model for Few-Shot LearningTL;DRMethodVisual processing and Perceiver ResamplerGATED XATTN-DENSE layersMixture of Vision 阅读全文

posted @ 2025-07-26 15:41 fariver 阅读(120) 评论(0) 推荐(0)

[思考] Reinforcement Learning on LLM

摘要：引爆推理革命：从PPO到GRPO，强化学习如何重塑大语言模型引言：当强化学习遇上大型语言模型近年来，大型语言模型（LLM）以前所未有的速度席卷了人工智能领域。然而，预训练的LLM虽然知识渊博，但其输出往往难以完全符合人类的价值观和特定任务的需求。为了解决这一“对齐”难题，一种新的技术范式——基阅读全文

posted @ 2025-07-22 21:44 fariver 阅读(566) 评论(0) 推荐(0)

[PaperReading] KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMS

摘要：目录KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMSTL;DRMethodRL Prompt Set制作Long-CoT Supervised Fine-Tuning强化学习算法长度惩罚采样策略视觉数据Long2short CoT模型Model 阅读全文

posted @ 2025-07-21 20:37 fariver 阅读(150) 评论(0) 推荐(0)

[PaperReading] DAPO: An Open-Source LLM Reinforcement Learning System at Scale

摘要：目录DAPO: An Open-Source LLM Reinforcement Learning System at ScaleTL;DRBackgroundMethodClip-HigherDynamic SamplingOverlong Reward ShapingExperiment总结与思阅读全文

posted @ 2025-07-20 18:58 fariver 阅读(82) 评论(0) 推荐(0)

[PaperReading] QWENLONG-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

摘要：目录QWENLONG-L1: Towards Long-Context Large Reasoning Models with Reinforcement LearningTL;DRMotivationsuboptimal training efficiencyunstable optimizati 阅读全文

posted @ 2025-07-20 15:07 fariver 阅读(40) 评论(0) 推荐(0)

[PaperReading] Training language models to follow instructions with human feedback

摘要：目录Training language models to follow instructions with human feedbackTL;DRMethodDatasetModelSupervised fine-tuningReward modeling(RM)Reinforcement Lea 阅读全文

posted @ 2025-07-17 21:58 fariver 阅读(134) 评论(0) 推荐(0)

[PaperReading] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

摘要：目录R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningTL;DRMethodVerifiable RewardRLVRExperiment总结与思考相关链接 R1-Omni: Exp 阅读全文

posted @ 2025-07-15 21:28 fariver 阅读(59) 评论(0) 推荐(0)

[PaperReading] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

摘要：目录DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningTL;DRMethodExperiment总结与思考相关链接 DeepSeek-R1: Incentivizing Reasonin 阅读全文

posted @ 2025-07-15 20:28 fariver 阅读(58) 评论(0) 推荐(0)

[PaperReading] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

摘要：目录DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsTL;DRMethodData CollectionDeepSeekMath-Base 7B训练与评估Reinforcement 阅读全文

posted @ 2025-07-11 20:08 fariver 阅读(160) 评论(0) 推荐(0)

[RL Tutorial] 强化学习 - 李宏毅

摘要：目录Reforce Learning Tutorial课程内容基本概念Policy Gradient - 方案演进Version0Version1Version2Version3Version3.5Version4Policy Gradient - On-policy Vs Off-policyOn 阅读全文

posted @ 2025-07-05 14:17 fariver 阅读(156) 评论(0) 推荐(0)

基础知识

摘要：分布式通信原语 Broadcast: 将一张XPU卡数据复制同步到其它所有XPU卡上 Scatter: 将一张XPU卡数据切片分发到其它所有XPU卡上 Reduce：接收其它所有XPU卡上数据，通过某种操作(Sum/Mean/Max)之后，最终放到某个XPU卡上 Gather: 接受其它所有XPU卡阅读全文

posted @ 2025-07-02 20:21 fariver 阅读(41) 评论(0) 推荐(0)

[思考] LLM训练工程优化

摘要：背景大语言模型（LLM）参数量已突破万亿，单次训练计算量达千亿亿次浮点运算（ExaFLOPs）。单卡GPU显存上限仅80GB（A100），算力峰值312 TFLOPS，显存墙与通信墙成为千卡/万卡分布式训练的核心瓶颈。前置知识 1. DDP训练过程数据切片：全局Batch拆分为子Bat 阅读全文

posted @ 2025-07-02 20:19 fariver 阅读(203) 评论(0) 推荐(0)

fariver

07 2025 档案

公告