본문 바로가기

RL

[Paper Review] Don’t throw away your value model!Generating more preferable text with Value-Guided Monte-CarloTree Search decoding (PPO-MCTS)

maotter 2024. 12. 11. 17:38

paper: Liu, Jiacheng, et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding." First Conference on Language Modeling. 2024.

link: https://arxiv.org/abs/2309.15028

[Abstract]

PPO에서 얻어지는 Value model이 적절한 guidance로써 text decoding에 도움을 줌

1. Preliminaries

Guided Decoding
- 어떤 goal에 대해서 s_t = (w,x_<t)라는 partial output을 evaluation function을 이용하여 평가하고 decode를 이어나가는 것
PPO
- Policy objective & Value objective
- Policy objective는 surrogate objective를 maximize
- Value objective는 return과 estimated value 간의 차이를 minimize

2. Method

1. Select

PUCT algorithm으로 selection
prior probability는 PPO policy model
Q(s,a)는 edge (s,a)의 Q-function으로 PPO value model로부터 derived

2. Expand

prior policy distribution이 계산되고 Top-k action들을 생성
Child node의 V_bar는 zero-initialized

3. Evaluate

Monte-Carlo rollout은 하지 않고 Value model로 V(s*)를 evaluate
children edge의 Q(s*,a)는 V(s*)로 initialize

4. Backup

Inference

simulation 끝나면 action은 root node의 children의 visit count에 비례해서 decode

3. Experiments

Tasks
- sentiment steering
- toxicity reduction
- knowledge introspection
- helpful and harmless chatbots
Baselines
- direct decoding from the same PPO policy model (nucleus sampling)
- best-of-N decoding (n=20,50)

4. Notes

PPO에서 사용한 value model을 버리지 않고 나중에 decoding 시에 guidance로 활용한다는 점이 흥미로움
또한 look ahead를 위해서 MCTS를 사용하는데 이 때 rollout 없이 value model로 바로 evaluate하여 시간 및 비용을 단축함
value model를 활용해서 MCTS를 사용하되 시간 및 비용을 단축하는 방법을 떠올리면 어떠할까

'RL' 카테고리의 다른 글

[Paper Review] Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReST^EM) (0)	2024.12.17
[Paper Review] AlphaZero-Like Tree-Search can GuideLarge Language Model Decoding and Training (TS-LLM) (0)	2024.12.15
[Paper Review] ReFT: Reasoning with Reinforced Fine-Tuning (0)	2024.12.09

티스토리툴바