본문 바로가기

RL

(4)
[Paper Review] Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReST^EM) paper: Singh, Avi, et al. "Beyond human data: Scaling self-training for problem-solving with language models." arXiv preprint arXiv:2312.06585 (2023).link: https://arxiv.org/abs/2312.06585[Beyond Human Data: Scaling Self-Training for Problem-Solving with Language ModelsFine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models ..
[Paper Review] AlphaZero-Like Tree-Search can GuideLarge Language Model Decoding and Training (TS-LLM) paper: Feng, Xidong, et al. "Alphazero-like tree-search can guide large language model decoding and training." arXiv preprint arXiv:2309.17179 (2023).link: https://arxiv.org/abs/2309.17179 Alphazero-like Tree-Search can Guide Large Language Model Decoding and TrainingRecent works like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by usin..
[Paper Review] Don’t throw away your value model!Generating more preferable text with Value-Guided Monte-CarloTree Search decoding (PPO-MCTS) paper: Liu, Jiacheng, et al. "Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding." First Conference on Language Modeling. 2024.link: https://arxiv.org/abs/2309.15028[Abstract]PPO에서 얻어지는 Value model이 적절한 guidance로써 text decoding에 도움을 줌1. PreliminariesGuided Decoding어떤 goal에 대해서 s_t = (w,x_PPOPolicy objective & Value objectivePolic..
[Paper Review] ReFT: Reasoning with Reinforced Fine-Tuning paper: ReFT: Reasoning with Reinforced Fine-Tuning (Trung et al., ACL 2024) ReFT: Reasoning with Reinforced Fine-TuningLuong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.aclanthology.org[Abstract]CoT data에 SFT만 하면 Generalization 능력이 떨어짐ReFT: SFT로 warmup한 뒤, o..