Posts

Flamingo: a Visual Language Model for Few-Shot Learning

google deepmind, NIPS 2022 โ€œFlamingo: a Visual Language Model for Few-Shot Learningโ€ ๐Ÿ’ก ๊ธฐ์กด์˜ ๊ฑฐ๋Œ€ ๋ชจ๋ธ๋“ค์„ Frozen์‹œํ‚จ ์ฑ„, Gated Cross-Attention์„ ํ†ตํ•ด ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•จ์œผ๋กœ์จ ํ•™์Šตํ•˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด visual tas...

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)

bookmark bookmark NIPS 2023 oral Abstract llava ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์—์„œ llm์˜ ์„ค๊ณ„ ์„ ํƒ์ง€๋ฅผ ํ†ต์ œ๋œ ํ™˜๊ฒฝ์—์„œ ์กฐ์‚ฌํ•œ ์ตœ์ดˆ์˜ ์ฒด๊ณ„์ ์ธ ์—ฐ๊ตฌ llava์˜ ๊ธฐ์กด ๋ฐฉ์‹์ธ **fully connected vision-language connector**๊ฐ€ ์˜ˆ์ƒ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๊ฐ•๋ ฅํ•˜๊ณ  ๋ฐ์ดํ„ฐ...

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

bookmark Abstract VLM์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ Chain-of-visual-thought (COVT)๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•จ ๋ฌธ์ œ์ : ๊ธฐ์กด VLM์€ ์–ธ์–ด์  ์ถ”๋ก ์—๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ๊ณต๊ฐ„์  ์ถ”๋ก ์ด๋‚˜ ๊ธฐํ•˜ํ•™์  ์ธ์‹๊ณผ ๊ฐ™์ด ๋ฐ€๋„ ๋†’์€ ์‹œ๊ฐ์  ์ง€๊ฐ์ด ํ•„์š”ํ•œ ์ž‘์—…์—์„œ๋Š” ์–ด๋ ค์›€์„ ๊ฒช์Œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์ œํ•œ์ ์ธ ํ…...

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Abstract MLLM์˜ ๋ฐœ์ „ - ์—ฌ๋Ÿฌ VQA tasks ํ•˜์ง€๋งŒ interpretability๊ฐ€ ์•ฝํ•˜๊ณ , ๋‹ต์— ๊ด€ํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ์ง€์—ญ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ๋ณต์žกํ•œ visual ์ž…๋ ฅ์„ ์–ด๋ ค์›Œํ•จ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋ณธ ์—ฐ๊ตฌ๋Š” **๋Œ€๊ทœ๋ชจ์˜ visual CoT ๋ฐ์ดํ„ฐ์…‹์„ ์ˆ˜์ง‘ํ•˜๊ณ  ์ œ์‹œํ•จ** 438k์˜ question-an...

Lecture 14: Reasoning

bookmark ๊ฐ•์˜ ์ž๋ฃŒ ์ •๋ฆฌ Reasoning. ์ถ”๋ก ์ด๋ž€? โ€œ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ์ •๋ณด๋ฅผ ๋‹จ์„œ๋กœ ์‚ผ์•„ ๋ชฐ๋ž๋˜ ์‚ฌ์‹ค์„ ์•Œ์•„๋‚ด๋Š” ๊ณผ์ •โ€ ์ •๋ณด ์‚ฌ์ด์˜ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ๋ฅผ ์ฐพ์•„์„œ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๋Š” ๋Šฅ๋ ฅ ์ฐธ๊ณ  bookmark second letter concatenation ์ฒซ๋ฒˆ์งธ ๋‹จ...

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1 Abstract 3D AIGC (AI-generated content)์˜ ์‘์šฉ ๋ถ„์•ผ ๋‹ค์–‘ํ•จ ๋งŽ์€ ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅํ–ˆ์œผ๋‚˜, ์—ฌ์ „ํžˆ 3d ๋ฐ์ดํ„ฐ์˜ ์ˆ˜์ง‘, ์ฒ˜๋ฆฌ, ํ•™์Šต ๊ณผ์ •์ด ๋ณต์žกํ•ด์„œ, ์—ฐ๊ตฌ์ž/๊ฐœ๋ฐœ์ž/๋””์ž์ด๋„ˆ์—๊ฒŒ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์˜์—ญ์œผ๋กœ ๋‚จ์•„ ์žˆ์Œ Hunyua...

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

CVPR 2025 Workshop Emotion interpretation์ด๋ผ๋Š” task๋ฅผ ์ƒˆ๋กœ ์ œ์‹œ, ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹ (+ํ‰๊ฐ€ ๊ธฐ์ค€) ์ œ์‹œ โ€ํ•œ ์ธ๋ฌผ์˜โ€ ๊ฐ์ •์— ์ง‘์ค‘ํ•จ - ์ „์ฒด ์žฅ๋ฉด์— ๋Œ€ํ•œ ๊ฐ์ •์€ ์•„๋‹˜ ํ•™์Šต x ํ‰๊ฐ€๋งŒ o (ํ‰๊ฐ€์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹ + ์ง€ํ‘œ ์ œ์‹œํ•œ ๊ฒƒ์ด novelty) Abstract ๊ธฐ์กด ๊ฐ์ • ๋ถ„์„...

LLaVA:ย Largeย Languageย andย Visionย Assistant

Visual Instruction Tuning NIPS 2023 โ†’ LLM์— ์ด๋ฏธ์ง€ ์ธ์‹ ๋Šฅ๋ ฅ์„ ๋”ํ•ด์ค€ ๊ฒƒ์ด ํŠน์ง• โ€“์ฐธ๊ณ โ€“// LLaVA [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] LLaVA, LLaVA-1.5 Abstract machine-generated instruction-following ๋ฐ์ดํ„ฐ๋กœ LLM์„ instruction tuningํ•˜๋Š” ๊ฒƒ์€ ์ƒˆ๋กœ์šด...