Multimodal AI

AI yang dapat memahami dan menghasilkan beberapa modalitas — teks, gambar, audio, video, 3D. Contoh: GPT-4o, Claude 3.5 Sonnet, Gemini 2.5, Sora 2.

Multimodal: vision-language (CLIP, Flamingo, GPT-4V), audio (Whisper, VALL-E), video (Sora, Veo), 3D (NeRF, Gaussian Splatting). Unified models semakin dominan 2024+.

Also known as: AI multimodal
Print

Multimodal AI

Definisi

Multimodal AI adalah sistem yang dapat memahami dan menghasilkan lebih dari satu modalitas — teks, gambar, audio, video, 3D, sensor data, dll.

Tonggak

  • 2020 — CLIP (OpenAI) — text-image
  • 2021 — DALL-E — text-to-image
  • 2023 — GPT-4V — vision input
  • 2024 — GPT-4o — unified multimodal
  • 2024 — Sora — text-to-video
  • 2024 — Claude 3 — vision
  • 2024 — Gemini 1.5 — long context multimodal
  • 2025 — Gemini 2.5 — unified
  • 2025 — Veo 3 — video + audio
  • 2026 — Sora 2, Veo 3.5 — high-fidelity

Connected to

Not yet written

The following pages are referenced but don't exist yet — they'd make good future additions.

  • /concepts/kecerdasan-buatan

References

  1. Wikipedia

Type at least 2 characters to search.

Press to navigate, to open, esc to close.