Agentic RL 学习笔记

2025-09-20
默认分类
Agent, RL, LLM

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Ways to train：
PPO

Based on a critic network.

DPO

Based on Human-Perference dataset.

GRPO

Rewarded by group.

Agentic RL Components:
Plan

External:LLM gives potential actions, RL modules evaluate, MCTS guide final actions.
Internal:fine-tune LLM(modify model's parameters）

Memory

explicit:human readable text with summary.
implicit:unreadable,token-based,internal ability.

Perception

Grounding-Driven: enhance model's retrieve ability.
Tool-Driven: include tool that allow model to operate/edit images.
Generation-Driven: generate sketches/images to assist model in thinking.

Self-correction

Verbal:prompt-based
Internal:gradient-based
Iteraive:search-guided(MCTS)/Execution-guided curriculum generation(Low->High level)

Tool integrated RL

different from "plain-text generator
beteer emergent task solving ability
higher A->B->C capability(task-oriented)

RAG RL

like database,but specific fetch relevant data.

Problem of Integrating Slow Reasoning Mechanisms into Agentic Reasoning:

Stability
excessive latency/overthing
Way to solve: test-time scaling

Ways to support long-horizon reasoning:

integration of process-based supervision with final outcome rewards.
Extend perference optimization from single turn to multi-step segments.

Two different reward methods:

Process reward:More long-horizon, may cause overthing.
Outcome reward：less reward hacking

评论已关闭