Post

Uni3D: Exploring Unified 3D Representation at Scale

Uni3D: Exploring Unified 3D Representation at Scale

ICLR 2024 Spotlight / Authors from Beijing Academy of Artificial Intelligence, Tsinghua U, Peking U

https://github.com/baaivision/Uni3D

Abstract

  • ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ representation ํ•™์Šต์€ ๋Œ€๊ทœ๋ชจ ์Šค์ผ€์ผ๋ง ๋•๋ถ„์— ํญ๋ฐœ์ ์œผ๋กœ ๋ฐœ์ „ํ•จ, 3d ๊ฐ์ฒด/์žฅ๋ฉด์„ ๋Œ€๊ทœ๋ชจ๋กœ ์Šค์ผ€์ผ๋งํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์—†์Œ
  • ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Uni3D๋ฅผ ์ œ์•ˆํ•จ - ํ†ตํ•ฉ 3d ํ‘œํ˜„ ๋Šฅ๋ ฅ์„ ๊ฐ€์ง„ 3d foundation ๋ชจ๋ธ
    • pretrained ViT๋ฅผ ์ดˆ๊ธฐํ™”์ƒํƒœ๋กœ ์‚ฌ์šฉ
    • 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํŠน์ง•์„ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ๋œ feature space๋กœ end-to-end ์ •๋ ฌ
    • โ†’ 2d ์„ธ๊ณ„์—์„œ ์ด๋ฏธ ํ•™์Šต๋œ ํ‘œํ˜„๋ ฅ์„ 3d ์„ธ๊ณ„๋กœ ๋Œ์–ด์˜ค๋Š” ๊ตฌ์กฐ
    • โ€œ๋งค์šฐ ํฐ ๋ชจ๋ธ๋กœ ์„ฑ๋Šฅ ํ˜์‹ ์„ ์ด๋ฃฌ ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ ๋ถ„์•ผ์ฒ˜๋Ÿผ, 3d์—์„œ๋„ ํฐ ๋ชจ๋ธ๋กœ scaling up, ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ด๋ณด์ž!โ€
  • ์ด๋ฅผ ํ†ตํ•ด์„œ, 2d ๋ชจ๋ธ์˜ ์‚ฌ์ „ํ•™์Šต๋œ ์ง€์‹์„ ํ™œ์šฉํ•˜๊ณ , clip ๋“ฑ multimodal ๋ชจ๋ธ์˜ ์˜๋ฏธ ๊ณต๊ฐ„์„ ํ™œ์šฉ ๊ฐ€๋Šฅํ•จ โ†’ 3d representation ์Šค์ผ€์ผ ์—…
  • ๋‹จ์ˆœํ•œ ์•„ํ‚คํ…์ณ, ๋Œ€์‹  ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ 1B๊นŒ์ง€ ํ™•์žฅ โ†’ ์Šค์ผ€์ผ ์ฆ๊ฐ€์™€ ํ•จ๊ป˜ 3D ํ‘œํ˜„ ๋Šฅ๋ ฅ์ด ๊ณ„์† ํ–ฅ์ƒ๋จ
    • task: zero-shot classification, few-shot classification, open-world understanding, part segmentation์—์„œ ์ƒˆ๋กœ์šด ๊ธฐ๋ก์„ ๋‹ฌ์„ฑํ•จ

Introduction

  • 3D ํ‘œํ˜„ ํ•™์Šต์˜ ์ค‘์š”์„ฑ
  • ํ•˜์ง€๋งŒ ๊ธฐ์กด 3D ์—ฐ๊ตฌ๋Š” ์ž‘์€ ์Šค์ผ€์ผ์— ๋จธ๋ฌผ๋Ÿฌ์žˆ์Œ
    • ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜, ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ, ๊ณผ์ œ ๋‹ค์–‘์„ฑ ๋ชจ๋‘ ์ œํ•œ์ ์ž„
    • ๋ฐ˜๋ฉด, ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ๋Š” ์Šค์ผ€์ผ ํ™•์žฅ์„ ํ†ตํ•ด ์„ฑ๋Šฅ ํ˜์‹ ์„ ์ผ์œผํ‚ด
      • nlp โ†’ llm
      • vision โ†’ ๊ฑฐ๋Œ€ ViT, CLIP โ€ฆ
    • ์ด ์„ฑ๊ณต์„ 3d ์„ธ๊ณ„์— ์ ์šฉํ•˜๊ณ ์ž ํ•จ
    • โ€œ3D์—์„œ๋„ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์Šค์ผ€์ผ ํ™•์žฅ๊ณผ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต์„ ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๊ทน์ ์œผ๋กœ ์ข‹์•„์งˆ๊นŒ?โ€
    • ๊ธฐ์กด์˜ 3d ์Šค์ผ€์ผ๋ง ์‹œ๋„๋Š” ์žˆ์—ˆ์œผ๋‚˜, ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ
      • 3d ๋ฐฑ๋ณธ ๊ทœ๋ชจ๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ or ์ผ์ • ๊ทœ๋ชจ๋ฅผ ๋„˜์–ด์„œ๋ฉด ํ•™์Šต์ด ์ œ๋Œ€๋กœ ํ™•์žฅ๋˜์ง€ ๋ชปํ•จ (์ตœ๋Œ€ 72m)

        image.png

        • ulip, openshape๊ณผ ๋น„๊ต.. uni3d๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ ํ›จ์”ฌ ๋” ํผ
  • ๋ฐฉ๋ฒ•๋ก 
    • 3d encoder๋Š” 2d ViT๋กœ ์ดˆ๊ธฐํ™”ํ•จ
    • 3d point cloud ํ”ผ์ฒ˜๋ฅผ image-text ํ”ผ์ฒ˜ ๊ณต๊ฐ„์— ์ •๋ ฌํ•จ
    • ์•„ํ‚คํ…์ฒ˜์™€ pretext task๊ฐ€ ๋‹จ์ˆœํ•จ
      • 2d ๋ชจ๋ธ์„ ์ดˆ๊ธฐํ™”ํ•ด์„œ ์‰ฝ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ
      • clip/blip ๊ณ„์—ด image-text aligned ๋ชจ๋ธ์„ ํƒ€๊นƒ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ์Šค์ผ€์ผ๋ง ์‹คํ—˜
    • ์„ธ ๋ฐฉํ–ฅ์œผ๋กœ ์Šค์ผ€์ผ ํ™•์žฅ
      • ๋ชจ๋ธ ํฌ๊ธฐ ํ™•์žฅ: 6M โ†’ 1B
      • ์ดˆ๊ธฐํ™” ์†Œ์Šค ํ™•์žฅ: visual self-supervised โ†’ text-supervised
      • ํƒ€๊นƒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธํ™•์žฅ : 150M โ†’ 5B
    • ๋ชจ๋“  ๋ฐฉํ–ฅ์—์„œ ์Šค์ผ€์ผ์„ ํ‚ค์šธ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ง€์† ์ƒ์Šนํ•จ์„ ๋ฐœ๊ฒฌํ•จ
  • ๊ฒฐ๊ณผ
    • modelNet zero shot ์„ฑ๋Šฅ 88.2%, ์ผ๋ถ€ supervised ๋ฐฉ๋ฒ•์— ํ•„์ ํ•˜๋Š” ์„ฑ๋Šฅ์ž„
    • few-shot, part segmentation, open world understanding ๋“ฑ ์—ฌ๋Ÿฌ task์—์„œ sota ๋‹ฌ์„ฑ
    • ๊ทธ ์™ธ task์— ๋Œ€ํ•ด์„œ ์‘์šฉ์ด ๊ฐ•๋ ฅํ•จ

image.png

โ‡’ โ€œ2Dยท์–ธ์–ด ์„ธ๊ณ„์—์„œ scaling์ด ํ˜์‹ ์„ ๋งŒ๋“  ๊ฒƒ์ฒ˜๋Ÿผ, 3D์—์„œ๋„ scaling์ด ์„ฑ๋Šฅ์„ ํญ๋ฐœ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹คโ€๋Š” ์‚ฌ์‹ค์„ ์ฒ˜์Œ์œผ๋กœ ๊ฑฐ๋Œ€ ๊ทœ๋ชจ ์‹คํ—˜์œผ๋กœ ์ž…์ฆ

Method

image.png

3.1. Unified 3D representation

  • uni3d๋Š” 2d ViT ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ 3d์— ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž„
    • ๋ฐฑ๋ณธ์€ vanilla transformer๊ณ , 3d ์ž…๋ ฅ์„ ViT๊ฐ€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฐ ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ถ€๋ถ„๋งŒ ๊ต์ฒดํ•จ
      • patch embedding โ†’ point tokenizer๋กœ ๊ต์ฒด!
  • point tokenizer
    • FPS โ†’ KNN โ†’ PointNet โ†’ Transformer โ†’ 3d repr
    • FPS๋ฅผ ํ†ตํ•ด ๋Œ€ํ‘œ ํฌ์ธํŠธ๋ฅผ ์ƒ˜ํ”Œ๋งํ•จ
    • KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๊ฐ ๋Œ€ํ‘œ ํฌ์ธํŠธ ์ฃผ๋ณ€์˜ ์ด์›ƒ ํฌ์ธํŠธ๋ฅผ ๊ทธ๋ฃนํ•‘ํ•ด์„œ ํ•˜๋‚˜์˜ ๋กœ์ปฌ ์˜์—ญ์ด 3D ํŒจ์น˜๊ฐ€ ๋˜๋„๋ก ํ•จ
    • tiny pointNet ์ธ์ฝ”๋”
      • ๊ฐ 3d ํŒจ์น˜์—์„œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ
      • ์—ญํ• ?
        • ๊ธฐ์กด transformer์˜ ์ž…๋ ฅ ํ˜•์‹
          • [ํ† ํฐ1, ํ† ํฐ2, ํ† ํฐ3, โ€ฆ]
        • ๊ฐ ํŒจ์น˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ
          • patch1 = (์  1, ์  2, ์  3, โ€ฆ) โ†’ N๊ฐœ ์  (๊ฐœ์ˆ˜ ์ผ์ •ํ•˜์ง€ ์•Š์Œ) patch2 = (์  1, ์  2, ์  3, โ€ฆ) โ†’ ๋˜ ๋‹ค๋ฅธ N๊ฐœ ์  โ€ฆ
        • ์  ์ˆ˜๋ฅผ ์ผ์น˜์‹œํ‚ค๊ณ , ๊ณต๊ฐ„ ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด์„œ pointNet์„ ๊ฑฐ์น˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•จ
      • ViT์—์„œ์˜ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ๊ณผ ๋™์ผํ•œ ์—ญํ• ์„ ํ•˜๋„๋ก!
    • ๊ทธ ๋‹ค์Œ ์ด 3d ํ† ํฐ๋“ค์ด transformer์— ์ž…๋ ฅ๋จ โ†’ transformer โ†’ 3d representation ์ถ”์ถœ

Scaling up Uni3D

  • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ 3D scaling up ์‹คํŒจ ์‹œ๋„๋“ค..
    • ๋Œ€๋ถ€๋ถ„ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ๋ฐ˜, ์ž‘์€ ๋ชจ๋ธ ๊ทœ๋ชจ์— ๋จธ๋ฌด๋ฆ„
    • ์—ฐ๊ตฌ ํฌ์ปค์Šค๊ฐ€ ๋ชจ๋ธ ์•„ํ‚คํ…์ณ ์„ค๊ณ„์— ์ง‘์ค‘ํ•จ
    • objaverse ๋“ฑ์žฅ ์ดํ›„์— ์Šค์ผ€์ผ๋ง ์‹œ๋„ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์žˆ์—ˆ์œผ๋‚˜, ๋ฐฑ๋ณธ์ด ๋„ˆ๋ฌด ์ž‘์Œ
  • ์›์ธ?
    • 3D ๋ฐฑ๋ณธ์ด ํ†ต์ผ๋˜์–ด ์žˆ์ง€ ์•Š์Œ / ์ผ๊ด€๋œ ์Šค์ผ€์ผ๋ง ์ „๋žต ์ ์šฉํ•  ์ˆ˜ ์—†์Œ
    • ์ผ๋ถ€ ๋ฐฑ๋ณธ์€ ํฌ์ธํŠธ ์ „์ฒด์—์„œ ๋กœ์ปฌ ํŒจํ„ด์„ ์ง์ ‘ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ์‹ (DGCNN, PointMLP ๋“ฑ)
      • ๋ชจ๋ธ์ด ์ปค์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋Ÿ‰์ด ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ โ†’ ์Šค์ผ€์ผ ํ™•์žฅ์ด ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅํ•จ
  • Uni3d์˜ ์ ‘๊ทผ๋ฐฉ๋ฒ•
    • ๋‹ค๋ฅธ 3d ๋ฐฑ๋ณธ์€ ๊ฐ ๊ตฌ์กฐ๋งˆ๋‹ค ๋‹ค๋ฅธ ์Šค์ผ€์ผ ์ „๋žต์ด ํ•„์š”ํ•จ
    • Uni3d๋Š” ViT ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ โ†’ ์ด๋ฏธ ๊ฒ€์ฆ๋œ ์Šค์ผ€์ผ์—… ์ „๋žต์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ
    • ViT๊ฐ€ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹ ๊ทธ๋Œ€๋กœ Uni3d๋ฅผ ํ™•์žฅ์‹œํ‚ด
      • Tiny (6 M), Small (23M), Base (88 M), Large (307 M), Giant (1B)
      • ๋‹จ์ˆœํ•˜๊ฒŒ ViT๋ฅผ ํฐ ๋ฒ„์ „์œผ๋กœ ๊ต์ฒดํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์Šค์ผ€์ผ ์—…์„ ์‹œํ‚ด
  • ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด์„œ, ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ‚ค์šธ์ˆ˜๋ก ์ง€์†์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ์ƒ์Šนํ•จ์„ ํ™•์ธ
    • <scale = ์„ฑ๋Šฅ>์ด 3d์—์„œ ์„ฑ๋ฆฝํ•จ์„ ์‹ค์ฆ
    • ๊ณ„์‚ฐ ํšจ์œจ์„ฑ๊ณผ ํ•™์Šต ์•ˆ์ •์„ฑ๋„ ์œ ์ง€๋จ
  • ์ตœ์ข… ์„ฑ๊ณผ
    • 1B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 3d representation model์„ ์ตœ์ดˆ๋กœ ๊ตฌ์ถ•ํ•จ
    • 100๋งŒ๊ฐœ์˜ 3d shape, 1์ฒœ๋งŒ๊ฐœ ์ด๋ฏธ์ง€, 7์ฒœ๋งŒ๊ฐœ ํ…์ŠคํŠธ๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ alignment ํ•™์Šต
    • ์—ฌ๋Ÿฌ downstream task์—์„œ ๊ฐ•๋ ฅํ•œ ์ „์ด ์„ฑ๋Šฅ ํ™•์ธํ•จ

Initializing Uni3D

  • ๊ธฐ์กด 3d ์‚ฌ์ „ํ•™์Šต์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ
    • ๋ชจ๋ธ์„ ํฌ๊ฒŒ ๋งŒ๋“ค๋ฉด: overfitting, ์ˆ˜๋ ด ๋ถˆ์•ˆ์ • ๋“ฑ ๋ฐฑ๋ณธ ํ•™์Šต์ด ์–ด๋ ค์›€!
    • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ 3d ์ „์šฉ pretext task๋กœ ์‚ฌ์ „ํ•™์Šต์„ ๋จผ์ € ํ•˜๋Š” ๊ฒƒ
      • ํ•œ๊ณ„: ์‚ฌ์ „ํ•™์Šต ๋น„์šฉ์ด ํผ, ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ์ด ์ž‘์•„์„œ ๊ฐ•๋ ฅํ•œ prior๋ฅผ ๋งŒ๋“ค๊ธฐ ์–ด๋ ค์›€
  • Uni3D์˜ ์ ‘๊ทผ๋ฐฉ๋ฒ•
    • 3d ๋ฐฑ๋ณธ์„ ViT๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— 3D ์ „์šฉ ์‚ฌ์ „ํ•™์Šต ํ•  ํ•„์š”๊ฐ€ ์—†์Œ
    • ์ด๋ฏธ์ง€/๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์—์„œ ์ด๋ฏธ ํ•™์Šต๋œ ๊ฑฐ๋Œ€ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ๊ทธ๋Œ€๋กœ ์ดˆ๊ธฐํ™” ์ง€์ ์œผ๋กœ ์‚ฌ์šฉ
      • ์ด๋ฏธ ํ•™์Šตํ•œ ๋Œ€๊ทœ๋ชจ ์ง€์‹, ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„ ๋Šฅ๋ ฅ์„ leverageํ•˜๋Š” ๊ฒƒ
      • ex. 2d self-supervised (dino, eva ๋“ฑ), text-image ์ •๋ ฌ ๋ชจ๋ธ (clip ๋“ฑ)
      • ์–ด๋–ค transformer ๋ชจ๋ธ์„ ๊ฐ€์ ธ์™€๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ!!
    • ํ•œ๋งˆ๋””๋กœ pretrained vit์—์„œ ํ•™์Šต์„ ์‹œ์ž‘ํ•ด์„œ 3d ์„ธ๊ณ„์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก finetuningํ•˜๋Š” ๊ฒƒ
    • ์ด๋ฅผ ํ†ตํ•ด,
      • ๋Œ€ํ˜• 3d ๋ฐฑ๋ณธ์—์„œ๋„ overfitting, ํ•™์Šต ๋ถˆ์•ˆ์ • ํ˜„์ƒ์ด ํฌ๊ฒŒ ์™„ํ™”
      • ๊ฑฐ๋Œ€ํ•œ ๋ชจ๋ธ ๊ทœ๋ชจ์—์„œ๋„ cross-modal contrastive learning์ด ์ˆ˜์›”ํ•ด์ง

3.2. Multi-Modal Alignment

  • ulip, openshape์˜ ํŒจ๋Ÿฌ๋‹ค์ž„๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ language, image, point cloud ์‚ฌ์ด์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ ฌ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ

Datasets

  • ๋™์ผํ•œ ์กฐ๊ฑด์—์„œ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ, openshape์ด ์ œ๊ณตํ•œ ์•™์ƒ๋ธ” 3d ๋ฐ์ดํ„ฐ์…‹์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•ด ํ•™์Šตํ•จ
    • objaverse, shapeNet, 3D-FUTURE, ABO
    • 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ์ณ ๊ฑฐ๋Œ€ 3d ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์šฉ
  • ์ „์ฒ˜๋ฆฌ
    • pc 10000๊ฐœ ์ƒ˜ํ”Œ๋ง (rgb ํฌํ•จ)
    • 10๊ฐœ์˜ ๋ Œ๋”๋ง ์ด๋ฏธ์ง€ ์ƒ์„ฑ
    • openshape๊ณผ ๋™์ผํ•˜๊ฒŒ triplet์„ ๊ตฌ์„ฑํ•˜์˜€์Œ

Objective

  • ํ•™์Šต ๋ชฉํ‘œ: 3d ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํŠน์ง•์„ clip์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŠน์ง• ๊ณต๊ฐ„๊ณผ ์ •๋ ฌํ•˜๋„๋ก 3d ์ธ์ฝ”๋” f_p๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ
  • ํ•™์Šต ๋Œ€์ƒ: ๐Ÿ”ฅ3d encoder only, โ„๏ธimage/text encoder๋Š” ํ•™์Šต x
  • ์ž…๋ ฅ: triplet (pc, image, text)
  • ํ”ผ์ฒ˜ ์ •๊ทœํ™” l2 normalization โ†’ e_p, e_i, e_t๋ฅผ ๋งŒ๋“ฆ
    • ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์ง์ ‘ dot product๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋Œ

image.png

  • ์ด 4๊ฐœ์˜ ์ •๋ ฌ ๋ชฉํ‘œ (openshape, ulip2์™€ ๋™์ผํ•จ)
    • 3d๋ฅผ ๊ณ ์ •, ํ…์ŠคํŠธ๋ฅผ ๋ณ€ํ™”: ์ •๋‹ต ํ…์ŠคํŠธ์™€ ๊ฐ€๊นŒ์›Œ์ง€๊ณ , ์˜ค๋‹ต ํ…์ŠคํŠธ์™€ ๋ฉ€์–ด์ง
    • ํ…์ŠคํŠธ๋ฅผ ๊ณ ์ •, 3d๋ฅผ ๋ณ€ํ™”: ์ •๋‹ต 3d์™€ ๊ฐ€๊นŒ์›Œ์ง€๊ณ , ์˜ค๋‹ต 3d์™€ ๋ฉ€์–ด์ง
    • 3d๋ฅผ ๊ณ ์ •, ์ด๋ฏธ์ง€๋ฅผ ๋ณ€ํ™”: ์ •๋‹ต ์ด๋ฏธ์ง€์™€ ๊ฐ€๊นŒ์›Œ์ง€๊ณ , ์˜ค๋‹ต ์ด๋ฏธ์ง€์™€ ๋ฉ€์–ด์ง
    • ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •, 3d๋ฅผ ๋ณ€ํ™”: ์ •๋‹ต 3d์™€ ๊ฐ€๊นŒ์›Œ์ง€๊ณ , ์˜ค๋‹ต 3d์™€ ๋ฉ€์–ด์ง

Image-Text aligned target

  • uni3d๋Š” ํŠน์ • clip ๋ชจ๋ธ์— ์ข…์† x ์–ด๋–ค clip teacher๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • teacher CLIP์ด ์ปค์งˆ์ˆ˜๋ก Uni3D alignment๊ฐ€ ๋” ๊ฐ•๋ ฅํ•ด์ง€๊ณ  ์„ฑ๋Šฅ๋„ ์ƒ์Šน

Experiment

4.1. Zero-shot Shape Classification

  • ๋ฐ์ดํ„ฐ์…‹: ModelNet (15 ์นดํ…Œ๊ณ ๋ฆฌ), ScanObjNN (40), Objaverse-LVIS (1,156)
    • openshape์˜ ์„ธํŒ…์„ ๋”ฐ๋ฆ„
    • objaverse-lvis: 10,000 colored point ์ƒ˜ํ”Œ๋ง
    • ModelNet40: 10,000 ํฌ์ธํŠธ ์ƒ˜ํ”Œ๋ง, ์ƒ‰์€ x
    • ScanObjNN: ์ƒ‰ ์—†๋Š” 2048 ํฌ์ธํŠธ ์ƒ˜ํ”Œ๋ง, obj_only version
  • ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ: PointCLIP, PointCLIP V2, ULIP, OpenShape
    • PointCLIP, PointCLIP V2: ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ํˆฌ์˜ํ•ด์„œ 2d cilp์œผ๋กœ ์ง์ ‘ ๋ถ„๋ฅ˜
    • ULIP, OpenShape: 3d ๋ฐฑ๋ณธ์„ ํ•™์Šตํ•œ ํ›„ 3d โ†’ clip์— ์ •๋ ฌ

image.png

  • โ€œensembledโ€: 4๊ฐœ์˜ 3d ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต
  • โ€œensembled no LVISโ€: ์œ„ ๋ฐ์ดํ„ฐ์—์„œ LVIS ๋ฐ์ดํ„ฐ๋ฅผ ์ œ์™ธํ•œ ๋ฒ„์ „
  • ์•™์ƒ๋ธ” ๋ฒ„์ „๊ณผ no lvis ๋ฒ„์ „ ๋‘˜๋‹ค uni3d๊ฐ€ ๊ธฐ์กด sota๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ๋Šฅ๊ฐ€ํ•จ
  • โ€  ๊ธฐํ˜ธ๋Š” ๊ฐ ๋ฒค์น˜๋งˆํฌ(ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹)์—์„œ ํ•ด๋‹น ๋ชจ๋ธ์ด ๊ธฐ๋กํ•œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ํ‘œ์‹œ๋ผ๋Š”๋ฐ.. ๋” ํฐ ๋ชจ๋ธ์„ ์˜๋ฏธ?

4.2. Few-shot Linear Probing

  • linear probing?
    • ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ‘œ์ค€ ๋ฐฉ์‹
    • ๋ฐฉ๋ฒ•
      • ํ•™์Šตํ•œ representation model (์—ฌ๊ธฐ์„œ๋Š” uni3d)๋Š” freeze
      • ์ ์€ labeled ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ linear classifier๋ฅผ ํ•™์Šต
    • representation์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ํ•™์Šต ๋œ๊ฑด์ง€๋ฅผ ํ…Œ์ŠคํŠธํ•จ (๋ณ„๋„์˜ ๋ชจ๋ธ ํ•™์Šต์€ x)
    • ๊ฐ€์ •: representation์ด ์ข‹์„์ˆ˜๋ก ์ ์€ labeled ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์„ ํ˜• classifier๊ฐ€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ โ†’ few-shot ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๋ฐ์— ์ ํ•ฉํ•˜๋‹ค
  • objaverse-lvis์— ๋Œ€ํ•ด์„œ ์ˆ˜ํ–‰
    • ํด๋ž˜์Šค ๋‹น ๋ผ๋ฒจ์ด 1, 2, 4, 8, 16๊ฐœ ์žˆ๋Š” few-shot ํ™˜๊ฒฝ
    • 1-shot ์„ค์ •์ด๋ฉด โ†’ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋‹น labeled ์ƒ˜ํ”Œ์ด 1๊ฐœ๋งŒ ์ œ๊ณต๋จ
    • zero-shot์€ few-shot๊ณผ ๋‹ค๋ฅด๊ฒŒ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ์˜ ์œ ์‚ฌ๋„ ๋น„๊ต ๋ฐฉ์‹, few-shot์€ linear classifier ํ•™์Šต ๊ธฐ๋ฐ˜ ํ‰๊ฐ€

image.png

  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋ชจ๋“  few-shot ์„ค์ •์—์„œ uni3d๊ฐ€ ๋‹ค๋ฅธ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์„ ํฐ ํญ์œผ๋กœ ๋Šฅ๊ฐ€ํ•จ
  • ์ ์€ ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ ํ™˜๊ฒฝ์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์ „์ด ์„ฑ๋Šฅ์„ ๊ฐ€์ง

4.3. Open-World Understanding

  • uni3d๊ฐ€ ์‹ค์ œ ์„ธ๊ณ„์˜ 3d ์žฅ๋ฉด๊ณผ ๋ฌผ์ฒด๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ‰๊ฐ€ํ•จ
  • ๋ฐ์ดํ„ฐ์…‹: ScanNet- ์‹ค์„ธ๊ณ„์—์„œ ์Šค์บ”๋œ ์‹ค๋‚ด ์žฅ๋ฉด 1500๊ฐœ๋กœ ๊ตฌ์„ฑ๋œ ๋Œ€๊ทœ๋ชจ 3d ๋ฐ์ดํ„ฐ์…‹
  • ๋ชฉํ‘œ: ๊ฐ ๊ฐ์ฒด instance์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ zero-shot ๋ฐฉ์‹์œผ๋กœ ์ธ์‹
    • instance segmentation x, category classification๋งŒ ํ‰๊ฐ€ํ•จ

image.png

  • ๊ธฐ์กด ๋ฐฉ๋ฒ• ์ค‘ ๋‹ค์ˆ˜๋Š” ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋กœ ์ถ”๊ฐ€ ํ›ˆ๋ จ๋จ (TP ๋ถ™์–ด์žˆ๋Š” ๊ฒƒ๋“ค - real-world point cloud-image-text triplets๋กœ ์ถ”๊ฐ€ํ•™์Šตํ•จ)
  • ๊ทผ๋ฐ uni3d๋Š” ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ๋„ ๋ณด์ง€ ์•Š๊ณ  ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ›ˆ๋ จํ–ˆ๋Š”๋ฐ ๊ฐ€์žฅ zero-shot ์„ฑ๋Šฅ์ด ๋†’์Œ โ‡’ ์‹ค์„ธ๊ณ„ 3D generalization ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Œ
  • why?
    • uni3d๋Š” clip์˜ ๋Œ€๊ทœ๋ชจ real-world ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ง€์‹์„ ๊ฐ€์ ธ์˜ด โ†’ ๊ฐ•ํ•œ ์‹ค์„ธ๊ณ„ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”์—ˆ๊ธฐ ๋•Œ๋ฌธ
    • ๋Œ€๊ทœ๋ชจ ์Šค์ผ€์ผ๋ง๋œ ๋ชจ๋ธ ๋•๋ถ„์— ํ‘œํ˜„ capacity๊ฐ€ ํฌ๋‹ค

image.png

  • instance segmentation๋œ ๊ฒฐ๊ณผ๊ฐ€ ์ด๋ฏธ ์ œ๊ณต์ด ๋˜๊ณ , ๊ฐ instance๋ฅผ zero-shot์œผ๋กœ ๋ถ„๋ฅ˜ํ•œ ๊ฒฐ๊ณผ

4.4. Open-Vocabulary / few-shot part segmentation

  • [part segmentation]
    • 2d ๋ถ„์•ผ์—์„œ๋Š” clip์˜ vision-language ์ง€์‹์„ downstream task์— ์ „์ดํ•˜๋ฉด ํ•ด๋‹น task์˜ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•จ - ํ•˜์ง€๋งŒ 3d์—๋Š” ๊ทธ๋Ÿฐ ์—ฐ๊ตฌ ๊ฑฐ์˜ ์—†์Œ
    • 3d์—์„œ๋„ clip ๊ธฐ๋ฐ˜ ํ‘œํ˜„์„ ํ†ตํ•ด part segmentation ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•จ
    • ๋ฐ์ดํ„ฐ์…‹: shapenet
    • 1-shot, 2-shot์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ ๋น„๊ต

      image.png

    • 1,2-shot์˜ ๊ฒฝ์šฐ์— pointBERT๋ฅผ ํฐ ์ฐจ์ด๋กœ ์ด๊น€
    • ๋ฒ ์ด์Šค๋ผ์ธ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ 10~20%๋กœ ๋Š˜๋ฆผ โ†’ ๊ทธ๋ž˜๋„ uni3d๊ฐ€ ๊ฑฐ์˜ ๋Œ€๋ถ€๋ถ„ ๋” ์šฐ์ˆ˜ํ•˜๋”๋ผ~~
    • ์ฆ‰, ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์€ 10~20%์˜ ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ uni3d๋Š” 1-2 shot ๋งŒ์œผ๋กœ ๋‹ฌ์„ฑํ•จ

    โ†’ ๊ทธ์ •๋„๋กœ uni3d์˜ ํ‘œํ˜„๋ ฅ์ด ๊ฐ•๋ ฅํ•ด์„œ task-specific supervision์ด ์ ์–ด๋„ task๋ฅผ ์ž˜ ์ˆ˜ํ–‰๊ฐ€๋Šฅ

  • [open-vocabulary part segmentation]
    • ๋ฏธ๋ฆฌ ๋ณธ ์  ์—†๋Š” part ์ด๋ฆ„์— ๋Œ€ํ•ด์„œ๋„ ํŒŒํŠธ ์ˆ˜์ค€์˜ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๊ณ  segmentationํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์‹คํ—˜
      • uni3d๊ฐ€ local 3d geometry + semantic cue๋ฅผ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์ดํ•ดํ•˜๋Š”์ง€
      • 3d part-level ๊ฐœ๋…์„ open vocab์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”์ง€
    • ๊ฐ์ฒด ๋‚ด๋ถ€์˜ ์„ธ๋ถ€ part ์˜๋ฏธ๊นŒ์ง€ open vocab์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธ
    • shapenet ๋ฐ์ดํ„ฐ์…‹์„ seen๊ณผ unseen์œผ๋กœ ๋‚˜๋ˆ” (์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ์ค€ seen๊ณผ unseen์ด ์žˆ๋Š” ๊ฒƒ)
      • โ€œUni3D๋Š” ์ผ๋ถ€ part ์ด๋ฆ„์€ ํ•™์Šต ์ค‘์— ๋ณด๊ณ , ๋‚˜๋จธ์ง€ part ์ด๋ฆ„์€ ๋‹จ ํ•œ ๋ฒˆ๋„ ๋ณธ ์  ์—†์ด ํ…Œ์ŠคํŠธ ์‹œ ์ฒ˜์Œ ๋ณธ๋‹คโ€

    image.png

    • ๊ฒฐ๊ณผ์ ์œผ๋กœ seen์—์„œ๋Š” ๋ฌผ๋ก , unseen์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
    • clip์—์„œ ์ฆ๋ฅ˜๋œ ์‹ค์„ธ๊ณ„ ์ง€์‹ ๋•๋ถ„์— ๊ฐ์ฒด ์ „์ฒด์˜ semantic๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŒŒํŠธ ์ˆ˜์ค€์˜ ์ •๊ตํ•œ local 3d ํŒจํ„ด๊นŒ์ง€ ํ‘œํ˜„ ๋‚ด๋ถ€์— ํ•™์Šตํ•ด๋ฒ„๋ฆผโ€ฆ

    โ‡’ Uni3D๋Š” open vocab 3d part ์ธ์‹์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ฒซ 3d foundation backbone

4.5. Point cloud painting

image.png

  • 3d ๊ฐ์ฒด์˜ ์„ธ๋ฐ€ํ•œ semantic ํŒจํ„ด์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ดํ•ดํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ƒˆ๋กœ์šด ์‚ฌ๋ก€๋ฅผ ์ œ์‹œํ•จ
  • point cloud painting: ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์— ๋งž๊ฒŒ pc์— ์ƒ‰์ƒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ž‘์—…
  • pc์˜ ์ž„๋ฒ ๋”ฉ๊ณผ text ํ”„๋กฌํ”„ํŠธ์˜ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„๊ฐ€ ์ตœ๊ณ ๊ฐ€ ๋˜๋„๋ก rgb๊ฐ’์„ ์ตœ์ ํ™”
  • ๋ฐ”๋€Œ๋Š” ๋Œ€์ƒ์€ pc์˜ rgb๊ฐ’
  • ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, prompt๊ฐ€ ํฌํ•จํ•˜๋Š” ๋ณต์žกํ•œ ์˜๋ฏธ๋ฅผ ๋ฐ˜์˜ํ•ด์„œ ์ƒ‰์„ ์ž…ํž ์ˆ˜ ์žˆ์Œ
    • uni3d๊ฐ€ contrastive learning์„ ํ†ตํ•ด์„œ ํ”„๋กฌํ”„ํŠธ ๋‹จ์œ„์˜ ์˜๋ฏธ ๊ตฌ์กฐ๊นŒ์ง€ ํ•™์Šตํ–ˆ์Œ์„ ๋ณด์—ฌ์คŒ

4.6. Cross-modal Retrieval

  • ๊ฒ€์ƒ‰์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž„๋ฒ ๋”ฉ ํ™œ์šฉ โ† ๋‚ด๊ฐ€ ํ•˜๊ณ  ์žˆ๋Š” applications!!
  • ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ, 3d๋ฅผ ํ•œ ๊ณต๊ฐ„์—์„œ ๋น„๊ต๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, 3d shape์„ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Œ
  • ์ด๋ฏธ์ง€ โ†’ 3d ๊ฒ€์ƒ‰

image.png

  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์‹ค์‚ฌ์ธ๋ฐ, ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ๋™์ž‘ํ•จ~~
  • ์ฒซ๋ฒˆ์งธ ์—ด์€ image to 3d ๊ฒ€์ƒ‰
  • ๋‘๋ฒˆ์งธ ์—ด์€ ๋‘ ์žฅ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ๊ณ  ์ž„๋ฒ ๋”ฉ ํ‰๊ท ์œผ๋กœ ๊ฒ€์ƒ‰
    • ์—ฌ๋Ÿฌ signal์„ ์ทจํ•ฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์คŒ
  • ์„ธ๋ฒˆ์งธ ์—ด์€ text-to-3d ๊ฒ€์ƒ‰

4.7. Ablation Study

  • ๊ธฐ๋ณธ ์„ธํŒ…
    • 3d backbone: vit base
    • backbone ์ดˆ๊ธฐํ™” weight: EVA pretrained weights
    • CLIP: EVA-CLIP-E
    • ํ•™์Šต ๋ฐ์ดํ„ฐ: Ensembled (no-LVIS)
      1. Scaling up Model size
  • ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋ฉด ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ์ข‹์•„์ง€๋Š”๊ฐ€?
  • ๊ตฌ์กฐ์ ์œผ๋กœ ViT์™€ ๋™์ผํ•œ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์‚ฌ์šฉํ•จ
    • ์•ž์„œ ์–ธ๊ธ‰ํ•œ tiny, small, base, large, giant 5๋ฒ„์ „ (์ด๋ฏธ์ง€์—์„œ ์‚ฌ์šฉํ•œ ๊ทธ๋Œ€๋กœ)

image.png

  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ ๊ทœ๋ชจ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ํŠนํžˆ giant ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ๋Š” ์ด์ „ 3d ์—ฐ๊ตฌ์—์„œ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์˜ representation ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
    1. Switching / scaling up CLIP teachers
  • uni3d์˜ ์„ฑ๋Šฅ์ด ์–ด๋–ค clip ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋А๋ƒ์— ๋”ฐ๋ผ์„œ ์–ผ๋งˆ๋‚˜ ๋‹ฌ๋ผ์ง€๋Š”๊ฐ€?
    • clip์ด ๊ฐ•๋ ฅํ•  ์ˆ˜๋ก uni3d๋„ ๊ฐ•๋ ฅํ•ด์ง€๋Š”๊ฐ€?
  • clip ๋Œ€์‹  openai-clip, openclip, eva-clip ๋“ฑ ๋Œ€๊ทœ๋ชจ clip (openclip-bigG, eva-clip-e ..)

    image.png

  • clip์ด ๊ฐ•ํ•˜๋ฉด uni3d ์„ฑ๋Šฅ์ด ์ข‹์•„์ง โ†’ ๊ฐ€์žฅ ํฐ ํฌ๊ธฐ์˜ clip (๊ฐ€์žฅ ๋ฐ‘ ํ–‰)์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์ตœ๊ณ  ์„ฑ๋Šฅ
  • teacher๊ฐ€ ๊ฐ•ํ• ์ˆ˜๋ก 3d encoder์—๊ฒŒ ์ „๋‹ฌ๋˜๋Š” semantic signal๋„ ๋” ์ •๊ตํ•˜๊ณ  ํ’๋ถ€ํ•ด์ง
  • clip ๋ชจ๋ธ์ด ๋ฐœ์ „ํ•จ์— ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์„ ๊ฐˆ๊ธฐ๋งŒ ํ•˜๋ฉด uni3d๋„ ํ•จ๊ป˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์คŒ
    1. Initializing Transformer
  • uni3d๋ฅผ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋А๋ƒ๊ฐ€ ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€?
  • ์ดˆ๊ธฐํ™” x, 2d pretrained vit (DINO, EVA), ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ clip (EVA-CLIP), EVA + freeze vit - EVA๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋˜ backbone์„ freeze, fine-tuning ์—†์ด ์‚ฌ์šฉ
  • DINO, EVA-CLIP, EVA ์„ธ ๊ฐ€์ง€ ๋ชจ๋‘ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉํ•œ ๋’ค Uni3D ๋ฐฉ์‹์œผ๋กœ ๋‹ค์‹œ ํ•™์Šต(fine-tuning)ํ•œ ๊ฒฐ๊ณผ

    image.png

  • eva๊ฐ€ ๊ฐ€์žฅ ์ข‹์Œ, eva + freeze vit๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์Œ โ†’ 2d pretrained ๋ฐฑ๋ณธ์„ ๊ณ ์ •ํ•ด์„œ ์“ฐ๋ฉด ์•ˆ๋จ (fine tuning์ด ํ•„์ˆ˜)

Conclusion

  • uni3d๋Š” 3d ๋ชจ๋ธ์„ 1b ๋‹จ์œ„ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•œ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ์ž„
  • 3d ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ƒˆ๋กœ ์„ค๊ณ„ํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ViT๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•จ
    • vit ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์ ?
      • 2d์—์„œ ์ด๋ฏธ ํ™•๋ฆฝ๋œ scaling up ์ „๋žต์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
      • 2d pretrained ๊ฐ€์ค‘์น˜๋ฅผ ์ดˆ๊ธฐํ™”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ
  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 1m 3d, 10m ์ด๋ฏธ์ง€, 70m ํ…์ŠคํŠธ์˜ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•จ
    • pc feature๋ฅผ ์ด๋ฏธ ์ž˜ ์ •๋ ฌ๋œ image-text ํ”ผ์ฒ˜ ๊ณต๊ฐ„์— ์ž˜ ์ •๋ ฌํ•˜๋ ค๊ณ  ํ•จ
  • ์—ฌ๋Ÿฌ task์— ์žˆ์–ด์„œ SOTA๋ฅผ ์ฐ์Œ โ†’ 3D ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ foundation model์ž„
This post is licensed under CC BY 4.0 by the author.