Agentic RL 学习笔记
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Ways to train:
PPO
- Based on a critic network.
DPO
- Based on Human-Perference dataset.
GRPO
- Rewarded by group.
Agentic RL Components:
Plan
- External:LLM gives potential actions, RL modules evaluate, MCTS guide final actions.
- Internal:fine-tune LLM(modify model's parameters)
Memory
- explicit:human readable text with summary.
- implicit:unreadable,token-based,internal ability.
Perception
- Grounding-Driven: enhance model's retrieve ability.
- Tool-Driven: include tool that allow model to operate/edit images.
- Generation-Driven: generate sketches/images to assist model in thinking.
Self-correction
- Verbal:prompt-based
- Internal:gradient-based
- Iteraive:search-guided(MCTS)/Execution-guided curriculum generation(Low->High level)
Tool integrated RL
- different from "plain-text generator
- beteer emergent task solving ability
- higher A->B->C capability(task-oriented)
RAG RL
- like database,but specific fetch relevant data.
Problem of Integrating Slow Reasoning Mechanisms into Agentic Reasoning:
- Stability
excessive latency/overthing
Way to solve: test-time scaling
Ways to support long-horizon reasoning:
- integration of process-based supervision with final outcome rewards.
- Extend perference optimization from single turn to multi-step segments.
Two different reward methods:
- Process reward:More long-horizon, may cause overthing.
- Outcome reward:less reward hacking
评论已关闭