Heaptalk, Jakarta — OpenAI introduced its latest large language model (LLM), GPT-4o, with o referring to omni. This model has the ability to reason across audio, vision, and text in real-time (05/13). Users can utilize its text and image capabilities in ChatGPT starting on launch day, while other capabilities will follow gradually.
GPT-4o can receive any combination of text, audio, and image as input to generate any text, audio, and image, resulting in a much more natural human-computer interaction. This LLM can answer audio instructions in as little as 232 milliseconds (ms), with an average of 320 ms, much like human response time in a conversation.
The ability is equal to GPT-4 Turbo’s performance on text and code in English, with significant improvement on text in non-English languages. It is also much faster and 50% cheaper in the API. OpenAI claimed that GPT-4o has an exceptional understanding of vision and audio compared to existing models.
Understanding 50 languages
“GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, we’re able to make a GPT-4 level model available much more broadly. GPT-4o’s capabilities will be rolled out iteratively,” OpenAI stated in its official blog (05/13).
Moreover, GPT-4o understands 50 languages with improved speed and quality. This omni model also accomplishes GPT-4 Turbo-level performance on text, reasoning, and coding intelligence for only half the price of GPT-4 Turbo. This attainment sets new high watermarks on multilingual, audio, and vision capabilities. GPT-4 Turbo was released in November 2023 with a 128k context window.
The phenomenal AI chatbot ChatGPT maker trained this LLM as a single new model end-to-end across text, vision, and audio. Therefore, all inputs and outputs are processed by the same neural network. The company said, “Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.”