alibaba reuters 1603085318053.jpg

Alibaba Qwen 2.5 Vision Language Model Released In A Smaller Size, Packs Agentic Capabilites

Alibaba’s Qwen Team Released Another Artificial Intelligence (AI) Model to the Qwen 2.5 Family on Monday. Dubbed Qwen 2.5-VL-32B Instruct, The AI ​​Model Comes With Improved Performance and Optimizations. It is a vision language model with 32 billion parameters, and joins the three billion, seven billion, and 72 billion parameter size models in the Quen 2.5 family. Just like all previous models by the team, it is also an open-source ai model available under a permissive license.

Alibaba Releases Qwen 2.5-VL-32B Ai Model

In a blog postThe Qwen Team Detailed The Company’s Latest Vision Language Model (VLM). It is more capable than the Qwen 2.5 3B and 7B models, and smaller than the foundation 72b model. The Large Language Model’s (LLM) Older Versions Outperformed Deepsek-V3, and the 32B model is said to be outperforming Google and Mistral’s Similar SISSTEMS.

Coming to its features, The Qwen 2.5-VL-32B-Instruct has an adjusted output style that provides more detailed and better-formatted responses. The Researchers Claimed That The Responses are Closely Aligned With Human Preferences. Mathematical Reasoning Capability has also also been improved, and the AI ​​Model Can Solve More Complex Problems.

The accuracy of image undersrstanding capability and reasoning-focused analysis, include image parsing, content recognition, and visual logic deeduction, has also beed.

Qwen25vl Benchmark Qwen 2 5 Vl 32B Instruct

Qwen 2.5-VL-32B-Instruct
Photo Credit: Qwen

Based on Internal Testing, The Qwen 2.5-VL-32B is class to have surpassed the capables of comparable models, Such as Mistral-Small-Small-Small-Small-24B and Google’s gemma-3-27b, on the mmmu, MMMU-PRO, and Mathvista Benchmarks. Interestingly, the llm was also claimed to have outpermed the much larger Qwen 2-VL-72B model on the mm-mt-Bench.

The Qwen Team Highlights that the latest model can directly play as a visual agent that can reason and direct tools. It is inharently capable of computer use and phone use. It accepts text, images, and videos with more than one hour of Duration as Input. It also supports json and structured outputs.

The Baseline Architecture and Training Remain the same as the older Qwen 2.5 Models, However, The Researchers implemented a dynamic fps sampling to enable the model the model to complete to complete. Another enhancement also let the Pinpoint Specific Moments in a Video By Gaining An Understanding of Temporal Sequence and Speed.

Qwen 2.5-VL-32B-Instruct is available to download on github and its hugging face ListingThe Model Comes with Apache 2.0 License, which allows both academic and commercial usage.

Similar Posts