Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Abstract MLLM의 발전 - 여러 VQA tasks 하지만 interpretability가 약하고, 답에 관한 정보가 있는 지역의 크기가 작은 복잡한 visual 입력을 어려워함 이 문제를 해결하기 위해서, 본 연구는 **대규모의 visual CoT 데이터셋을 수집하고 제시함** 438k의 question-an...

Jan 5, 2026 blog

Lecture 14: Reasoning

bookmark 강의 자료 정리 Reasoning. 추론이란? “이미 알고 있는 정보를 단서로 삼아 몰랐던 사실을 알아내는 과정” 정보 사이의 연결고리를 찾아서 논리적으로 사고하는 능력 참고 bookmark second letter concatenation 첫번째 단...

Jan 1, 2026 lectures

Lecture 11: High-Resolution, High-Performing LVLMs

bookmark 강의 자료 정리 Toward High Resolution, High Performance - Intro LLaVA-NeXT: Improved reasoning, OCR, and world knowledge (2024년 1월) https://llava-vl.github.io/blog/2024...

Dec 28, 2025 lectures

Uni3D: Exploring Unified 3D Representation at Scale

ICLR 2024 Spotlight / Authors from Beijing Academy of Artificial Intelligence, Tsinghua U, Peking U https://github.com/baaivision/Uni3D 💡Uni3D = pretrained ViT를 기반으로 3D 입력만 바꿔서 멀티모달 정렬로 파인튜닝한 대규...

Nov 25, 2025 blog

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

CVPR 2024 / Authors from salesforce, stanford u, pennsylvania u, texas at austin u https://github.com/salesforce/ULIP Abstract multimodal pretraining의 발전으로 3d representation learning의 성능이 올라감...

Nov 24, 2025 blog

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1 Abstract 3D AIGC (AI-generated content)의 응용 분야 다양함 많은 모델들이 등장했으나, 여전히 3d 데이터의 수집, 처리, 학습 과정이 복잡해서, 연구자/개발자/디자이너에게만 접근 가능한 영역으로 남아 있음 Hunyua...

Nov 16, 2025 blog

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

CVPR 2025 Workshop Emotion interpretation이라는 task를 새로 제시, 벤치마크 데이터셋 (+평가 기준) 제시 ”한 인물의” 감정에 집중함 - 전체 장면에 대한 감정은 아님 학습 x 평가만 o (평가에 대한 데이터셋 + 지표 제시한 것이 novelty) Abstract 기존 감정 분석...

Sep 1, 2025 blog

LLaVA: Large Language and Vision Assistant

Visual Instruction Tuning NIPS 2023 → LLM에 이미지 인식 능력을 더해준 것이 특징 –참고–// LLaVA [논문 리뷰] LLaVA, LLaVA-1.5 Abstract machine-generated instruction-following 데이터로 LLM을 instruction tuning하는 것은 새로운...

Aug 29, 2025 blog

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

CVPR 2024 https://github.com/aimmemotion/EmoVIT Abstract Visual instruction tuning - task-specific instruction을 통해 사전학습된 언어 모델을 fine-tuning 여러 nlp task에서는 높은 zero-shot 능력을 보여주었지만, visual em...

Nov 20, 2024 blog