121 #opensourcellm

On Monday, a close knit group of international experts from around the world met in Lausanne to advance open and collaborative AI development, the proceedings of which I had the privilege to support and contribute to in a workshop setting. The Open-Source LLM Builders Summit was hosted by the EPFL AI Center and ETH AI Center on behalf of the Swiss AI Initiative. My notes in this blog are of a personal nature, and should not be taken as an official statement.

Building the Future of Open AI: Insights from the Open Source LLM Builder Summit

The second Open-Source LLM Builders Summit in Lausanne brought together researchers, engineers, and AI leaders to tackle a pressing question: how can open large language models scale globally without losing the agility, diversity, and innovation that define them? Through breakout sessions and plenary discussions, participants explored practical solutions for sharing knowledge, improving reproducibility, and building the systems that turn open models into deployable, real-world AI.

ETH AI CenterOleg Lavrovsky

Our group's official recap of the event

2nd Open Source LLM Builders Summit | Swiss AI

Swiss AI

Visit the Swiss AI website for the latest info

Presentation recordings

Overview

Open source LLMs are released under open-source licenses, allowing anyone to use, modify, and distribute them freely. Trained on large datasets, they can perform tasks such as text generation, translation or question answering. These models offer powerful tools while promoting inclusivity and transparency, but their use requires careful consideration of technical, ethical, and legal factors. Early examples include GPT-J (Apache license), BLOOM (RAIL), XLM-R and GPT-2 (MIT).

A Comprehensive Overview of Large Language Models

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.

arXiv.orgHumza Naveed

While the focus of the talks was mostly on the technical development of LLMs, among the participants were people collaborating on Public AI: an actively researched concept and platform to provide access through a public inference utility. I spoke about this recently in an aiLights webinar – not needing to host the models themselves makes open source LLMs more accessible and usable by the general public, especially for those who may not have the resources or expertise to deploy such infrastructure. We covered this in the later workshop sessions.

This was the second edition of the LLM Builders Summit (see 2025 PR, recap, slides, videos), emphasizing the importance of collaboration, open standards, and growth of international projects. The presentations ranged from a historical view on CERN’s collaborative principles, to pervasive arguments on the need for transparent governance in open models. The day highlighted the work of diverse teams using centralized training infrastructure (e.g., Apertus, EuroLLM, SEA-LION), covering their progress and challenges.

Photo (c) EPFL AI Center, used with permission

We discussed the value of standardized reporting, community building, and public infrastructure access in breakout sessions. Practical challenges of data sovereignty in a legal sense, and sustaining long-term commitment in the financial, were covered alongside the strategic importance of developer engagement. The event underscored growing international efforts toward transparent, collaborative, and open AI development for the public good.

Model teams

In my notes below, you'll find links to the publications and community hubs of the various LLMs that were represented today. Update 6.3.2026: a couple of glitches have been addressed, after this table was used in a presentation – with my apologies & gratitude for the feedback.

Model Name	Number of Parameters	License
Apertus	8B, 70B	Apache 2.0
EuroLLM	1.7B, 9B, 22B	Apache 2.0
OpenEuroLLM	2.15B, 22B	Apache 2.0
Olmo 3	7B, 32B	Apache 2.0
Qwen 3	1.7B, 4B, 8B, 30B	Apache 2.0
K2-V2	70B	Apache 2.0
SEA-LION	3B, 7B, 70B	MIT
GLM 5	355B (32B active), 744B (40B active)	MIT

You can find further live coverage on Mastodon and LinkedIn tagged #opensourcellm

Post by @loleg@hachyderm.io

View on Mastodon

We started the day with a warm welcome from Prof. Antoine Bosselut, one of the Apertus leads and head of the famous NLP lab at EPFL. The group was encouraged to keep minds open to each other, even if we may have quite contradictory ideas on method and milestone.

The day's organizers Federico Fanalista and co-lead Arnout Devos, representing the two AI Centers, chimed in with the ground rules before we quickly dived in. For the next four hours, save for a short break, we were glued to our seats with an extremely in-depth and challenging set of presentations.

Andreas Hoecker (CERN) helped to further set the tone of the day with "Lessons from International Collaboration", showing how trust, respect, and clear processes are key for complex, visionary projects. He pointed out that long-term collaborations are built on commitment to transparent governance, not just a shared set of goals.

« Successful large collaborations are built on mutual trust, respect, and transparent, fair procedures, with leadership grounded in excellence and expertise. Open and Inclusive governance - through clear structures, representative decision bodies, merit-based evaluation of ideas, consensus building, and collective ownership of major decisions - ensures that all members, including early-career researchers, have a voice and that innovation is encouraged and rewarded »

Dr. Imanol Schlag was up next to give everyone a quick deep dive on Apertus, starting with the broad perspective of unprecedented investments globally and in Switzerland, focusing on the technical and infrastructure challenges in training the first, and a roadmap for the next model release 1.5.

With swift delivery, he put the world of Fully Open LLMs on one page, and outlined ways we already cooperate through the various components used in training and supporting open source models.

Dr. Schlag spoke in detail about the "fully open" path with Apertus, balancing legal and technical challenges to keep data and models transparent. The tough road ahead is one we need to stay on, with sights aimed squarely on the public good.

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `robots.txt` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

arXiv.orgProject Apertus

The EuroLLM team via André F. T. Martins next showcased their work on transparent, standardized training , and highlighted the importance of reproducibility in AI research. It is important to have a shared language for all the labs, an important task at the European level.

This was the first model that I ran in a production setting with the Farming Hackdays last year, and the group has come a long way. Their adherence to European principles is illustrated in their exciting 22B model release a few days ago. Let's get benchmarking!

EuroLLM-22B: Technical Report

This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.

arXiv.orgMiguel Moura Ramos

Kyle Lo (Allen AI) gave as quick and super practical talk, showing us new tools for data mixing that can boost model performance — the critical issue of not just throwing in more data but making sure it is the right mix. That was super practical, and got some of us scribbling notes for our own projects.

Colleague Luca Soldaini joined to talk about State of the Art Reasoning in-depth.

Olmix: A Framework for Data Mixing Throughout LM Development

Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.

arXiv.orgMayee F. Chen

Hector Liu (Kimi AI, MBZUAI) next described the "uphill battles" of open model development—how to sustain it despite funding and talent gaps. It felt like he was telling us to be realistic: share the load and be smarter about resources, and one in particular: talent! The discussion included thoughts on large scale distributed collaboration, open source models, and „high TPP validation using small scale proxies“. Lots of great practical hacks.

K2-V2: A 360-Open, Reasoning-Enhanced LLM

We introduce K2-V2, a 360-open LLM built from scratch as a superior base for reasoning adaptation, in addition to functions such as conversation and knowledge retrieval from general LLMs. It stands as the strongest fully open model, rivals open-weight leaders in its size class, outperforms Qwen2.5-72B and approaches the performance of Qwen3-235B. We actively infuse domain knowledge, reasoning, long-context, and tool use throughout the training process. This explicitly prepares the model for complex reasoning tasks. We demonstrate this potential using simple supervised fine-tuning, establishing a strong baseline that indicates significant headroom for advanced alignment. By releasing the full training history and data composition, we maximize the effectiveness of continuous training, a key open source production scenario. We release the model weights and signature LLM360 artifacts, such as complete training data, to empower the community with a capable, reasoning-centric foundation.

arXiv.orgK2 Team

Jian Gang Ngui (AI Singapore) shared the SEA-LION model’s journey, which built on a public leaderboard—a friendly competition that drives everyone forward. The talk made me think about how we can do similar for open models, especially with more diverse languages and data.

SEA-LION: Southeast Asian Languages in One Network

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

arXiv.orgRaymond Ng

Matthias Bethge (Tübingen University) presented OpenEuroLLM: A series of foundation models for transparent AI in Europe. We were walked through their work on standardized reporting and the proposed OpenPipeline, which could solve a lot of our headaches when trying to compare models: having a common language for how we report our results.

LLM generation novelty through the lens of semantic similarity

Generation novelty is a key indicator of an LLM’s ability to generalize, yet measuring it against full pretraining corpora is computationally challenging. Existing evaluations often rely on lexical overlap, failing to detect paraphrased text, or do not consider the full pretraining corpus. We frame novelty as a semantic retrieval problem. This framing enables us to address novelty with modern embedding and indexing pipelines, allowing for efficient analysis at pre-training scale. Specifically, we propose a three-stage framework that retrieves semantically similar samples, reranks them at varying subsequence lengths, and calibrates scores using a human novelty reference for interpretability. We apply this framework to the SmolLM model family and report three key findings: (1) models draw on pre-training data across much longer sequences than previously reported; (2) some task domains systematically promote or suppress generation novelty; and (3) instruction tuning not only alters style but also increases novelty. These results highlight the value of semantic novelty analysis for studying generalization. To support reproducibility and further research, we release ~20 TB of corpus chunks and index artifacts at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm

arXiv.orgPhilipp Davydov

We were invited to stay tuned for the next AI in Science (AIS) summit.

Post by @loleg@hachyderm.io

View on Mastodon

Yuxuan Zhang (Zhipu AI) next took us through their GLM series, emphasizing the modularity of the architecture and tips on developing multimodal features, which sparked some cool ideas on how we might adapt this for our own open models. This week was a big release (GLM 5), and the energy was tangible.

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents’ reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.

arXiv.orgJiajie Zhang

Junyang Lin (Alibaba Qwen) brought a fresh perspective on the developer community—how it drives the roadmap of open models. It’s not just about the tech; it’s about the people using it. Keeping a close eye to the needs and opportunities of collaboration, with case studies from the recent Coder Next update.

The models are being trained on a completely different scale here, and it was just fantastic to hear the exchange taking place between the teams. The Qwen team shared that hybrid architectures work, and that’s a good sign for the open models we’re all working on.

Parallel Scaling Law for Language Models

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model’s parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

arXiv.orgMouxiang Chen

Joost VandeVondele discussed production Machine Learning at Scale at the Swiss National Supercomputing Centre (CSCS), in particular the ALPS infrastructure for world-class science where the Apertus model is being developed. He is keeping a close watch of the capacity, making sure the systems are neither over, nor under-utilized. Great perspectives on sustainability aspects, and the open source stack involved. There was discussion of container management, observability, and the ubiquitous problem of early abort.

Evolving HPC services to enable ML workloads on HPE Cray EX

The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.

arXiv.orgStefano Schuppli

Timothy Llewellynn (Euresearch) introduced the EuroHPC Joint Undertaking: a set of access calls and collaboration opportunities for the AI Factory model that is being adopted across Europe. The nearly one hundred high-performance computing sites represent a formidable capacity - if it was applied properly. I had a chance to chat with Timothy in the break about the various opportunities the Cross-Cluster program and network for projects in Switzerland.

AI Factories: It’s time to rethink the Cloud-HPC divide

The strategic importance of artificial intelligence is driving a global push toward Sovereign AI initiatives. Nationwide governments are increasingly developing dedicated infrastructures, called AI Factories (AIF), to achieve technological autonomy and secure the resources necessary to sustain robust local digital ecosystems. In Europe, the EuroHPC Joint Undertaking is investing hundreds of millions of euros into several AI Factories, built atop existing high-performance computing (HPC) supercomputers. However, while HPC systems excel in raw performance, they are not inherently designed for usability, accessibility, or serving as public-facing platforms for AI services such as inference or agentic applications. In contrast, AI practitioners are accustomed to cloud-native technologies like Kubernetes and object storage, tools that are often difficult to integrate within traditional HPC environments. This article advocates for a dual-stack approach within supercomputers: integrating both HPC and cloud-native technologies. Our goal is to bridge the divide between HPC and cloud computing by combining high performance and hardware acceleration with ease of use and service-oriented front-ends. This convergence allows each paradigm to amplify the other. To this end, we will study the cloud challenges of HPC (Serverless HPC) and the HPC challenges of cloud technologies (High-performance Cloud).

arXiv.orgPedro Garcia Lopez

Blueprints for Global Collaboration

There were a lot of questions asked and a lively discussion took place in the breaks between the talks. The workshop sessions were where everyone had a chance to contribute in smaller groups. In the one I joined, facilitated by Valentina Pyatkin, we talked about shared infrastructure access – i.e., how to collaborate globally without losing our autonomy.

A workshop participant interacts with a Reachy Mini

Here is a summary of the round table, made with Apertus from my own notes:

We must recognize that collaboration and transparent sharing of data and research are critical for advancing AI, but strict European regulations pose hurdles. The CERN model emphasizes collaboration towards a common goal, but our own AI infrastructure needs to be viable for 40 years, competing with hyperscalers requires strategic partnerships. The growth of Hugging Face and its ecosystem demonstrates the power of organic collaboration, enabling small teams to make significant impacts.

While we've leveraged existing LLM pipelines and large datasets (like Common Crawl and Transformers), competition with American labs is challenging due to current geopolitical tensions. European fragmentation in AI development—with many projects but limited innovation—remains an issue, unlike China’s more unified approach, where open-source contributions are often a prerequisite for cooperation, even within universities.

Photo by Daniel Naeff

Training LLMs locally and contributing to open-source projects, such as through fine-tuning, is recognized and valued. The most difficult aspect remains the training pipeline, especially when faced with diverse hardware (e.g., not just NVIDIA GPUs but also NPUs from Chinese companies) and data scarcity. Datasets are crucial but often proprietary, making it hard to open up without corporate cooperation. We share our training pipeline and code, but pre-training design is more valuable to share.

To address fragmentation, fostering open-source community engagement and collaboration across borders is key. A potential solution could involve aligning with larger entities like corporate acquisitions or integrating with existing smaller platforms, though proving impact through products or academic papers is essential. Ultimately, true success is measured by actual applications and community engagement (e.g., GitHub stars, repo usage), not just model popularity metrics.

Post by @loleg@hachyderm.io

View on Mastodon

A plate of snacks with a roundtable in the background

This year’s summit felt like a professional and empathetic reunion, with the Swiss AI Initiative doing a great job of hosting all the visitors. It was an energizing mix of tech talks, real talk, and tasty food (special thanks to Jocelyne for the Moroccan treats pictured above!) The day ended with a sense of community and engaged perseverance. The summit filled me with a sense that the dark clouds of foreboding over the AI industry carry within them a silver lining of opportunity for these communities, engaged in open development. I am intent to carry the learnings and this conviction into my engagement here today, and in the days to come.

Sunset over the Jura on my way home to Bern