In November 2024, Allen Institute for AI, also known as Ai2, announced OLMo 2, a family of open-source large language models (LLMs), which it claims are on par with other leading open models like Meta’s Llama.
What makes OLMo 2 stand out from other LLM releases is the fully open source aspect, which gives users access to the data used to train the model — it is the AI model’s “secret sauce”.
Not even open-source leaders like Llama have disclosed their data sources; instead, they only disclose model weights.
Its mix of promising performance against leading models and that open status makes OLMo 2 one of the most important families of models to watch in 2025.
We take a look at everything we know about the OLMo 2 models so far, from how it was trained to what it means for the artificial intelligence community.
Key Takeaways
- Allen Institute for AI (Ai2) announces OLMo 2, a family of fully open-source LLMs.
- The company claims OLMO 2 can outperform Meta’s Llama model.
- Ai2 was originally founded in 2014 by Microsoft co-founder Paul Allen.
- OLMo 2 shows that fully open models can be competitive against other LLMs.
- Such models could eventually challenge proprietary models like ChatGPT.
Everything We Know About OLMo 2 So Far
Ai2 was initially founded by Microsoft co-founder Paul Allen in 2014 with a mission to conduct “high-impact research and engineering in the field of artificial intelligence, all for the common good.”
In February 2024, Ai2 released the first version of its OLMo models, which it has now updated with the launch of OLMo 2. The OLMo 2 family features two main models, including 7B and 13B parameter models trained on up to 5 trillion tokens, or around 3.75 billion words.
The latest models meet the criteria recently set by the Open Source Initiative to qualify as legitimate open-source AI.
Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more ?? pic.twitter.com/YQ7z8W9lE6
— Ai2 (@allen_ai) November 26, 2024
“Because fully open science requires more than just open weights, we are excited to share a new round of OLMO updates — including weights, data, code, recipes, intermediate checkpoints, and instruction-turned models — with the broader language modeling community,” the announcement blog post said.
OLMO 2 is designed to support a range of use cases, including question answering, text summarization, content creation, code generation, translation, solving math problems, and more. OLMO 2 model weights and data can be downloaded for free via HuggingFace, and the training code can also be accessed via GitHub.
In terms of performance, the initial stats offered by Ai2 have appeared promising. For example, the OLMo 7 billion model is said to be able to outperform the 8 billion parameter Llama model on some English language benchmarks.
It’s also worth noting that Ai2 has also used a post-training recipe known as Tülu 3 to build a variation of the model known as OLMo 2 Instruct, a category of models designed to execute user instructions more accurately.
The OLMo 2 13B Instruct model is available via the AI2 Playground and is said to outperform Qwen 2.5 14B Instruct, Tülu 3 8B, and the Llama 3.1 8B Instruct models.
How Was OLMo 2 Trained? Ai2’s Pre-Training Process
OLMo2’s pre-training process had two main stages. During the first stage, the company used a collection of 3.9 trillion tokens sourced from DCLM, Dolma, Starcoder, and Proof Pile II.
During the second stage of the pre-training process the researchers curated web data to further train the model.
This training content was filtered to feed the model high-quality and high-quality domain-specific data, such as academic content, Q&A forums, instruction data, and math workbooks, into the model.
This collection of synthetic and human-generated content is available via Hugging Face and consists of 843 billion tokens. Each of these stages is designed to ensure that OLMo 2 can respond to user inputs with greater accuracy.
Why Does OLMo 2 Matter?
Ai2’s OLMo 2 appears to be a critical release because it demonstrates how a fully open-source approach can be used to offer researchers more transparency about how a model was trained and why it generates the outputs it does.
Meta’s Llama models are very popular among AI researchers, but they aren’t fully open-source. For instance, Llama offers an open-weight approach that provides transparency over parameters learned during the training process but not the data the model was trained on.
This means that a developer using an open-weight model can’t fully understand why a model has chosen to produce the output that it has, which raises questions about how comprehensive the original training data was and whether it was subject to bias or prejudice.
At the same time, the more models like OLMo 2 emerge with a fully open approach, the more resources researchers are going to be able to call upon to train their own solutions. The more developers share pre-training techniques and datasets, the more these models can advance as a whole.
If enough researchers release their models like OLMo 2, then we could see open-source chat bots start to emerge that better compete against proprietary AI solutions like OpenAI’s ChatGPT or Google Gemini, which offer less insight into how decisions are made.
The Bottom Line
OLMO 2 appears to be an interesting addition to the open-source AI landscape and gives users extensive insights into the type of data used to train the solution.
If more AI researchers or research institutes like Ai2 go all-in and make weights and training data available to other researchers, then the gap between open source and proprietary AI is likely to close further as the community learns how to build better chatbots.
FAQs
What is Ai2’s OLMo 2 model?
How does OLMo 2 compare to Meta’s Llama?
Why is OLMo 2 important for AI research?
What tasks can OLMo 2 perform?
Where can I access OLMo 2?
How does OLMo 2’s transparency benefit AI development?
References
- Ai2 Playground (Playground.allenai)
- Perceptual Reasoning and Interaction Research (Prior.allenai)
- OLMo 2: The best fully open language model to date | Ai2 (Allenai)
- OLMo 2 – a allenai Collection (Huggingface)
- GitHub – allenai/OLMo: Modeling, training, eval, and inference code for OLMo (Github)
- mlfoundations/dclm-baseline-1.0 · Datasets at Hugging Face (Huggingface)
- allenai/dolma · Datasets at Hugging Face (Huggingface)
- bigcode/starcoderdata · Datasets at Hugging Face (Huggingface)
- EleutherAI/proof-pile-2 · Datasets at Hugging Face (Huggingface)
- allenai/dolmino-mix-1124 · Datasets at Hugging Face (Huggingface)