Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more鈥攋ust like today鈥檚 leading AI models.



Modern AI Models for Vision and Multimodal Understanding
This course is part of Computer Vision Specialization

Instructor: Tom Yeh
Access provided by New York State Department of Labor
Recommended experience
What you'll learn
Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.
Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.
Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.
Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.
Skills you'll gain
Details to know

Add to your LinkedIn profile
18 assignments
August 2025
See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate

There are 4 modules in this course
Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you鈥檒l explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You鈥檒l begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you鈥檒l dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You鈥檒l learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you鈥檒l understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.
What's included
14 videos7 readings4 assignments
This module invites you to explore how probability theory and sequential modeling power modern AI systems. You鈥檒l begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you鈥檒l transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You鈥檒l compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you鈥檒l understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.
What's included
15 videos2 readings5 assignments
This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You鈥檒l begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you鈥檒l transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you鈥檒l examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you鈥檒l understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.
What's included
15 videos2 readings5 assignments
In this module, you鈥檒l explore two transformative approaches in multimodal and generative AI. First, you鈥檒l dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You鈥檒l see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you鈥檒l shift to diffusion models, which generate images through a gradual denoising process. You鈥檒l learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.
What's included
11 videos2 readings4 assignments
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Build toward a degree
This course is part of the following degree program(s) offered by University of Colorado Boulder. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.鹿
听Instructor

Offered by
Why people choose 糖心vlog官网观看 for their career









