Video Mainstream adoption of generative AI technologies has, in large part, centered around the creation of text and images. But, as it turns out, the statistical probabilities on which these models are based are just as good at generating all manner of other media.
The latest example of this came on Monday when Google's AI lab DeepMind detailed its work on a video-to-audio model capable of generating sound to match video samples.
The model works by taking a video stream and encoding it into a compressed representation. This, alongside natural language prompts, acts as a guide for a diffusion model which, over the course of several steps, refines random noise into something resembling audio relevant to the input footage. This audio is then converted into a waveform and combined with the original video source.
As we understand it, this approach isn't all that different from how image generation models work, but, rather than emit pictures or illustrations, it's been trained to reproduce audio patterns from video and text inputs.
Here's one of several samples DeepMind released this week showing the model in action:
DeepMind says it used a variety of datasets which not only included video and audio, as you might expect, but AI-generated annotations and transcriptions to help teach the model to associate various visual events with different sounds. This, the researchers explained, means that the model can generate audio with or without a text prompt and doesn't require manual alignment of the tracks. However, there are still some hurdles to overcome.
For one, because of how audio is generated, the actual quality of the soundtrack is dependent on the source material. If the video quality is poor, the audio is likely to be as well. Lip sync has also proven to be quite challenging, to be polite.
DeepMind expects the new model to pair nicely with those designed for video generation, including its own in-house Veo model.
According to the DeepMind team, one of the problems with the current crop of text-to-video models is that they usually are limited to generating silent films. Combined with its video-to-audio model, the DeepMind team claims that entirely AI-generated videos, complete with soundtracks and even dialogue, is possible.
Speaking of video-gen models, the category has grown considerably over the past year with more players entering the space.
ML juggernaut OpenAI unveiled its own video-generation model called Sora back it February. But Sora is just one of several models pushing the envelope of what's possible.
Among these models is one from Kling AI. Developed by (partially state-owned) Chinese tech firm Kuaishou, Kling uses a combination of diffusion transformers to generate the frames and a "3D time-space attention system" to model motion and physical interactions within the scenes. Here's the system in action:
The results are videos that, if you don't look too closely, could easily be confused for human-generated video footage. However, on closer inspection, you'll quickly start to notice visual artifacts and incongruities. With that said, this seems to be a common theme with many of the AI video generators on the market today.
While details on Kling are scarce, its developers claim it's more capable than OpenAI's Sora. The model can supposedly produce videos up to two minutes in length at resolutions of 1080p and 30 frames per second. Unfortunately, access to the model is, for the moment, limited to China.
Another model builder working on video generation is Runway which, on Monday, revealed its Gen-3 Alpha model. Runway has been working on a number of image and video generation models going back to early 2023.
According to Runway, Gen-3 Alpha is one of several models currently under development and was trained on a combination of videos and images paired with highly descriptive captions. According to the startup, this allowed them to achieve more immersive transitions and camera movements than was possible with previous models. Here's this one in action:
Created by Tan KW | Jun 27, 2024
Created by Tan KW | Jun 27, 2024
Created by Tan KW | Jun 27, 2024
Created by Tan KW | Jun 27, 2024