Alibaba’s Qwen reasoning model with questions beats o1-preview


Sign up for our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Find out more


Chinese e-commerce giant Alibaba has released the latest model in its ever-expanding Qwen family. This is known as Qwen with Questions (QwQ) and serves as the latest open source contender o1 by OpenAI reasoning model.

Like other large reasoning models (LRMs), QwQ uses additional computation cycles during inference to review its answers and correct its errors, making it more suitable for tasks that require logical reasoning and planning such as mathematics and coding.

What is Qwen with Questions (OwQ?) and can it be used for commercial purposes?

Alibaba released a 32 billion parameter version of QwQ with a 32,000 token context. The model is currently in preview, which means a higher-performance version will likely follow.

According to Alibaba’s tests, QwQ beats o1-preview on AIME and MATH benchmarks, which evaluate mathematical problem-solving capabilities. It also outperforms o1-mini on GPQA, a benchmark for scientific reasoning. QwQ is lower than o1 in LiveCodeBench coding benchmarks, but still outperforms other frontier models such as GPT-4o AND Claude 3.5 Sonnet.

Qwen with questions
Qwen output example with questions

QwQ does not come with a companion document describing the data or the process used to train the model, making it difficult to reproduce the model’s results. However, because the model is open, unlike OpenAI o1, its “thought process” is not hidden and can be used to make sense of how the model thinks when solving problems.

Alibaba has also released the model under the Apache 2.0 license, which means it can be used for commercial purposes.

“We discovered something profound”

According to a blog article which was published alongside the model’s release: “Through deep exploration and countless tests, we discovered something profound: when given time to reflect, ask questions, and reflect, the model’s understanding of mathematics and programming it blooms like a flower opening in the sun… This process of careful reflection and self-questioning leads to remarkable progress in solving complex problems.”

This is very similar to what we know about how reasoning models work. By generating more tokens and reviewing previous answers, models are more likely to correct potential errors. Marco-o1another reasoning model recently published by Alibaba may also contain suggestions for how QwQ might work. Marco-o1 uses Monte Carlo tree search (MCTS) and self-reflection at the time of inference to create different branches of reasoning and choose the best answers. The model was trained on a combination of chain of thought (CoT) examples and synthetic data generated with MCTS algorithms.

Alibaba points out that QwQ still has limitations, such as mixing languages ​​or getting stuck in circular reasoning loops. The template is available for download at Hugging face and you can find a demo online at Embracing the spaces of the face.

The LLM era gives way to LRMs: large reasoning models

The release of o1 has sparked increased interest in creating LRMs, although not much is known about how the model works behind the scenes, other than using the time scale of inference to improve model responses.

Now I’m here several Chinese competitors at o1. The Chinese artificial intelligence laboratory DeepSeek was recently released R1-Lite Previewits first competitor, currently only available via the company’s online chat interface. R1-Lite-Preview reportedly beats o1 on several key benchmarks.

Another recently released model is LLaVA-o1developed by researchers from several Chinese universities, which brings the inference-time reasoning paradigm to open source vision language models (VLMs).

The focus on LRMs comes at a time of uncertainty about the future of model scaling laws. Reports indicate that AI labs like OpenAI, Google DeepMind, and Anthropic are getting diminishing returns from training larger models. And creating larger volumes of quality training data is becoming increasingly difficult as models are already trained on trillions of tokens collected from the internet.

Meanwhile, the inference time scale offers an alternative that could provide the next breakthrough in improving the capabilities of the next generation of AI models. There are reports that OpenAI is using o1 to generate synthetic reasoning data to train the next generation of its LLMs. The release of open reasoning models is likely to spur progress and make the space more competitive.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *