The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Ways to train:
PPO

  • Based on a critic network.

DPO

  • Based on Human-Perference dataset.

GRPO

  • Rewarded by group.

Agentic RL Components:
Plan

  • External:LLM gives potential actions, RL modules evaluate, MCTS guide final actions.
  • Internal:fine-tune LLM(modify model's parameters)

Memory

  • explicit:human readable text with summary.
  • implicit:unreadable,token-based,internal ability.

Perception

  • Grounding-Driven: enhance model's retrieve ability.
  • Tool-Driven: include tool that allow model to operate/edit images.
  • Generation-Driven: generate sketches/images to assist model in thinking.

Self-correction

  • Verbal:prompt-based
  • Internal:gradient-based
  • Iteraive:search-guided(MCTS)/Execution-guided curriculum generation(Low->High level)

Tool integrated RL

  1. different from "plain-text generator
  2. beteer emergent task solving ability
  3. higher A->B->C capability(task-oriented)

RAG RL

  • like database,but specific fetch relevant data.

Problem of Integrating Slow Reasoning Mechanisms into Agentic Reasoning:

  • Stability
  • excessive latency/overthing

    Way to solve: test-time scaling


Ways to support long-horizon reasoning:

  • integration of process-based supervision with final outcome rewards.
  • Extend perference optimization from single turn to multi-step segments.

Two different reward methods:

  • Process reward:More long-horizon, may cause overthing.
  • Outcome reward:less reward hacking

评论已关闭