Microsoft’s VASA-1 can deepfake a person with one photo and one audio track

Microsoft's VASA-1 is an AI model designed to generate lifelike videos of a person talking or singing based on a single image and audio input. It uses machine learning to create realistic facial expressions, head movements, and lip-syncing. Trained on the VoxCeleb2 dataset, VASA-1 can produce videos with 512x512 resolution at 40 frames per second. While it shows promise for applications like virtual avatars, Microsoft emphasizes it won't be used for impersonation or malicious content creation, aiming instead for positive uses such as education and accessibility improvements.

Read the full article to learn more.