Heaptalk, Jakarta — OpenAI introduced an AI model that can create video from text. Called Sora, the AI model generates videos up to a minute long. The company claimed Sora can maintain visual quality and comply with user’s prompts.
“We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction,” OpenAI stated on its company page.
Sora is claimed to be able to produce complex scenes with numerous characters, certain types of motion, as well as accurate details of subject and background. Additionally, the company said that the AI software can understand users’ instructions as well as how those things exist in the physical world.
Researchers in OpenAI developed Sora with a deep understanding of language which allows the software to accurately interpret instructions and produce compelling characters. In this way, produced video can express vivid emotions. Sora can also create multiple shots with a single generated video that maintains character and visual style.
Building on Dall E and GPT models
Sora builds on previous research in Dall E and GPT models. The model uses the recaptioning technique from Dall E 3 which involves creating highly descriptive text for visual training data. As a result, OpenAI claimed that the model can follow the user’s text instructions in the generated video more faithfully.
However, the company conveyed that the current model has several weaknesses, such as simulating the physics of a complex scene and understanding specific instances of cause and effect. “For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark,” OpenAI mentioned.
Besides, Sora may also confuse spatial details of an instruction, including mixing up left and right. The model may struggle with precise descriptions of events that occur over time, such as following a particular camera trajectory.