[Paper Review] Android in the Zoo:Chain-of-Action-Thought for GUI Agents (AITZ)

paper: Zhang, Jiwen, et al. "Android in the zoo: Chain-of-action-thought for gui agents." arXiv preprint arXiv:2403.02713 (2024)

link: https://arxiv.org/abs/2403.02713

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observa

arxiv.org

1. Introduction

기존의 연구들은 logic을 파악하기보다는 좌표와 같은 부분적인 것들에 주목함
Chain-of-Action-Thought(CoAT)는 screen description, action think, next action, action result를 활용함
작은 모델의 성능을 높이기 위해 AITZ라는 AITW를 가공한 데이터셋을 만들고 이를 이용하여 훈련시킴

2. Chain-of-Action-Thought (CoAT)

Screen description
- 주어진 스크린샷의 메인 요소들을 설명함
- screen type, primary apps, widgets ...
Action Think
- User query, current screen, history information을 모두 고려하여 goal을 달성하기 위한 가능한 행동들을 추론
Next Action Description
- click on the shopping cart icon과 같이 UI element나 screen function 작동을 설명함
Action Result
- 현재 observation과 action을 취했을 때 다음 observation을 연결하여 설명함

3. Android in the Zoo (AITZ)

Data collection
- AITW 데이터셋을 정제하고 샘플링함
  - clustering, annotator 등을 이용함
- 총 5147개의 unique instruction을 가진 episode를 모음
Semantic Anntotation
- GPT-4V를 이용하여 screen description, action think, next action description, action result 만들도록 함
- expert로 Verify함

4. Expermiental Setup

Baseline Models
- CogAgent
- Auto-UI
Metrics
- Atomic Metrics
  - action-matching score : both action type and action details match
- Episodic Metrics
  - goal progress : sequence에서 처음 error가 발생한 relative position을 의미함. 즉, 첫 번째로 불일치가 발생한 곳
    - ex) 정답이 [a,b,c,d]이고 예측이 [a,b,e]이면 세 번째 위치에서 첫 번째 오류 발생

5. Experiments

Zero-shot Evaluation
Fine-tuning Evaluation
Ablation study
Qualitative Analysis
- 단순히 좌표랑 행동만 history를 주는 것보다 description해서 주는 것이 더 나은 성능을 보임

Notes

goal progress만으로는 decision-sequence 문제에 효과 있다고 보기 어려운 것 같다
단순 action,좌표보다는 CoT처럼 textual description이 확실히 효과가 있는 것 같다

'Multimodal' 카테고리의 다른 글

[Paper Review] Android in the Zoo: Chain-of-Action-Thought for GUI Agents (AITZ) (0)	2024.12.31
[Paper Review] Multimodal Chain-of-Thought Reasoning inLanguage Models (MM-CoT) (0)	2024.12.08

MLZoo

[Paper Review] Android in the Zoo:Chain-of-Action-Thought for GUI Agents (AITZ)

1. Introduction

2. Chain-of-Action-Thought (CoAT)

3. Android in the Zoo (AITZ)

4. Expermiental Setup

5. Experiments

Notes

'Multimodal' 카테고리의 다른 글

티스토리툴바

[Paper Review] Android in the Zoo:Chain-of-Action-Thought for GUI Agents (AITZ)

1. Introduction

2. Chain-of-Action-Thought (CoAT)

3. Android in the Zoo (AITZ)

4. Expermiental Setup

5. Experiments

Notes

'Multimodal' 카테고리의 다른 글

'Multimodal' Related Articles

티스토리툴바