[Paper Review] Multimodal Chain-of-Thought Reasoning inLanguage Models (MM-CoT)

paper: Zhang, Zhuosheng, et al. "Multimodal chain-of-thought reasoning in language models." arXiv preprint arXiv:2302.00923 (2023).

link: https://arxiv.org/abs/2302.00923

Multimodal Chain-of-Thought Reasoning in Language Models

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily

arxiv.org

[Abstract]

Multimodal model의 CoT 과정에 대해서 분석
visual feature를 잘 사용하여 reason하도록 모델 구성
Two-stage framework로 rationale generation & answer inference

1. Introduction

기존의 CoT는 language modality에만 너무 의존함
image로부터 caption 만들고 caption과 text를 합쳐서 reason하는 것은 lack of information
100B 이하의 모델에서 CoT를 하면 hallucination이 존재함

2. Multimodal-CoT Analysis

QCM->AR(답변 생성 후 추론)이 오히려 QCM->RA(추론 후 답변 생성)보다 성능이 나음
단순히 maximum token limit이 문제가 아니라는 걸 확인함
QCM->R 하고 QCMR->A를 했을 때 여전히 성능이 떨어짐
- error case를 확인해보니 image feature를 제대로 활용하지 못하여 rationale generation 시에 hallucination이 존재하게 됨
단순 Caption을 사용하는 것이 아니라 Vision Features를 이용하면 성능이 좋아짐

3. Method

MM-CoT

Process
- Step 1. Text와 Vision을 input으로 넣어서 Rationale을 만듦
- Step 2. Rationale과 Text를 concat하고 Vision과 함께 input으로 넣어서 answer 생성
Architecture
- Two stage 모델 두 개 모두 같은 아키텍처
- gated fusion을 활용하여 image와 language feature를 mix

4. Experiments

Benchmark
- ScienceQA: large-scale multimodal science question dataset with annotated lectures and explanations
- A-OKVQA: knowledge-based visual quesiton answering benchmark requiring a broad base of commonsense and world knowledge
Model
- T5 encoder-decoder architecture
- FLAN-Alpaca to initialize LM
- frozen ViT-large encoder to obtain vision features

5. Notes

- Vision feature를 더 잘 사용하는 방법

- Two-stage가 아닌 One stage로 Unified된 framework

- RL을 통해 Vision과 Language 중 어느 곳에 비중을 둘 지 automatically learn

'Multimodal' 카테고리의 다른 글

[Paper Review] Android in the Zoo: Chain-of-Action-Thought for GUI Agents (AITZ) (0)	2024.12.31
[Paper Review] Android in the Zoo:Chain-of-Action-Thought for GUI Agents (AITZ) (0)	2024.12.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

MLZoo

[Paper Review] Multimodal Chain-of-Thought Reasoning inLanguage Models (MM-CoT)

[Abstract]

1. Introduction

2. Multimodal-CoT Analysis

3. Method

4. Experiments

5. Notes

'Multimodal' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Paper Review] Multimodal Chain-of-Thought Reasoning inLanguage Models (MM-CoT)

[Abstract]

1. Introduction

2. Multimodal-CoT Analysis

3. Method

4. Experiments

5. Notes

'Multimodal' 카테고리의 다른 글

'Multimodal' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역