paper-review 19
- World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
- SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
- When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
- MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
- Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- [MiniGPT-4] Enhancing Vision-Language Understanding with Advanced Large Language Models
- Flamingo: a Visual Language Model for Few-Shot Learning
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
- Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
- Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
- Uni3D: Exploring Unified 3D Representation at Scale
- ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
- Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
- Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
- LLaVA: Large Language and Vision Assistant
- EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning