paper-review 19

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning May 4, 2026
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models Apr 29, 2026
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning Apr 29, 2026
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning Apr 28, 2026
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Apr 27, 2026
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning Mar 30, 2026
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Mar 30, 2026
[MiniGPT-4] Enhancing Vision-Language Understanding with Advanced Large Language Models Feb 22, 2026
Flamingo: a Visual Language Model for Few-Shot Learning Feb 22, 2026
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Feb 22, 2026
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) Feb 10, 2026
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens Jan 31, 2026
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning Jan 5, 2026
Uni3D: Exploring Unified 3D Representation at Scale Nov 25, 2025
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding Nov 24, 2025
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material Nov 16, 2025
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models Sep 1, 2025
LLaVA: Large Language and Vision Assistant Aug 29, 2025
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning Nov 20, 2024