The post describes the process of installing and using Apertus MLX builds on an Apple MacBook. Details the installation of LM Studio, with some basic performance testing, and provides links to more in-depth analysis and further references. If you feel this is all too much, and would just like to use a ready-to-go service online, see my previous post for more direct waypoints:

Learn about my initial experiences working with the Apertus Large Language Model during the Swiss {ai} Weeks, with advice on getting started yourself.

Or watch this video summary for a quick recap of the Apertus model:

The laptop on my desk has a current M4 chip and 24 GB of RAM (non Pro), currently retailing at under 1500 CHF. The new Macbooks have been widely discussed, along with the Pro and mini M4 editions, as an excellent choice for users and developers. I have seen Apertus running impressively well on a Mac mini, and wondered how much mileage one could get out of a mid-range laptop with around 16 GB free memory for loading AI models. The laptop reviewed here is on loan from the AI Center at EPFL, which I recently joined 🎉

Note: the statements in my blog are my own and not that of my employer.

Why use local AI?

2026 is the year of AI on the desktop, with every IT company rushing out with updates to hardware and software that enables various levels of intelligent service. Some people find this an alarming trend - planned obsolescence is certainly not good for the environment. Meanwhile, the ability to use an AI model while offline, or in order to better understand and develop AI services, is a clear ask. Learning to become more proficient with local AI, may also help to stem some of our wasteful dependence on heavy cloud services.

Some arguments for being aware and minimizing the footprint of desktop AI agents was recently covered in an insightful talk by Whittaker and Tiwari, yet the actual practice of installing LLM and clarity around hardware requirements and capabilities is not yet there, at least not at customer level. The dark patterns and underwhelming safeguards are part of a slippery slope towards losing our digital self-determination.

Artificial intelligence in Switzerland: what’s new in 2026

The main drivers will be improvements to Swiss AI model Apertus, generative AI in hospitals and prioritising technological sovereignty.

www.swissinfo.chSara Ibrahim

It's great to see Apertus making top headlines in January, let's keep this up!

For many people, the “luxury” of cloud-less computing comes down to the hardware: do you have a machine that can run the model with the performance you need? Are you able to invest the time into building a Home Lab? Much progress is being made on tiny models and CPU optimization, to make the technology available everywhere. Clearly, we need lot of computational juice (and corresponding energy usage) to achieve some level of parity on quality – and in any case, we need to be ready to sacrifice some speed.

As local/open hardware fan, I would have loved to go for a Swiss-made machine, like the Why! Laptop, whose gamer model features NVidia chips. The StarLabs, Framework and Tuxedo laptops are all very shiny. Ryzen AI models are also ones to watch: like the Gen10 from Tuxedo: with 32 GB and the top-level HX 370 chip, it currently retails for around 1800 Euros. Over time, I aim to replace my Arch Linux-powered Latitude and home workstations with one of these slick and sustainably-manufactured machines.

Quite comfortable with Linux and Windows operating systems, I've decided to give macOS a fresh look, also to benefit my interactions with a wider range of the user community. Oh, that keyboard layout will take some getting used to.

Apple introduces the new MacBook Air with the M4 chip and a sky blue color

Apple announced the new MacBook Air, featuring the M4 chip, up to 18 hours of battery life, a 12MP Center Stage camera, and a lower starting price.

Apple Newsroom

Yea, yea, but how do you make a tilde (~) sign again?

What's the deal with MLX?

As announced by Apple last March, the M4 chip on the MacBook performs a cut above its predecessors, including the M3 Pro – at least in synthetic benchmarks. The edge on AI tasks is helped by interesting features, like the Scalable Matrix Extension - benchmarked and explored in more detail here:

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension

Modern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set architectures (ISAs) were designed for vector parallelism. In recent years, major hardware vendors have further increased the throughput of their CPUs by introducing matrix units with corresponding ISA extensions. The Scalable Matrix Extension (SME) has been announced for the Arm architecture in 2021 and Apple’s M4 chip is the first to support SME. This paper presents an in-depth study of SME on M4. Our microbenchmarks determine the maximum floating-point and fixed-point throughput of M4’s SME acceleration and study the achievable bandwidth for transfers to and from the matrix registers. Furthermore, we used the insights gained to design a just-in-time code generator for SME-based small matrix multiplications. The results presented show that M4’s SME support is FP32-centric, with an achievable throughput of over 2.3 FP32 TFLOPS. To maximize read and write bandwidth, loading and storing to and from the matrix registers must be done in two steps. Our just-in-time generated small matrix multiplication kernels outperform the vendor-optimized BLAS implementation in almost all tested configurations.

arXiv.orgStefan Remke

MLX is a framework that came out with the M1 chip that leverages Apple Silicon’s unified memory architecture and Metal GPU acceleration for efficient CPU/GPU operations. It simplifies both inference operations and AI app development on the new chips: enables local training of models using device-locked data, with support for low-rank adapters (LoRA) and quantized training for efficiency. Along with Apple Intelligence, it gives Mac users a ticket to secure, offline-capable AI applications for sensitive tasks. You can read all about it in the developer site:

GitHub - ml-explore/mlx: MLX: An array framework for Apple silicon

MLX: An array framework for Apple silicon. Contribute to ml-explore/mlx development by creating an account on GitHub.

GitHubml-explore

Through an optimizing of the weights and models, a model file is produced in a format that works magically well on Apple's chips. As a user of LM Studio, you do not need to do this yourself, as community builds are available. As a developer, or advanced user who wants to use a fresh model release, the procedure is relatively straightforward.

I converted the latest Apertus 8B model build to MLX using these commands, the process taking about 25 minutes on the Macbook:

git clone https://github.com/ml-explore/mlx-lm.git
uv pip install mlx_lm mlx
uv run mlx_lm convert --hf-path swiss-ai/Apertus-8B-Instruct-2509

(the setuptools didn't work quite right hence the lazy pip install)

loleg/Apertus-8B-Instruct-2509-mlx · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

In my tutorial below, I will use the mlx-community builds which exist in several quantized forms, and that you can download directly in LM Studio.

Getting started

In the benchmarking business, one likes to start with a clean slate: by making sure to disable Siri and Apple Intelligence on first boot, I can help to ensure that as much memory and compute capacity as possible is leveraged by the services of my choice.

Screenshot of Apple Intelligence & Siri settings

I must give a hat-tip to Apple for making it quite easy to disable or re-enable at any point: just be aware that data will be collected, and shared with 3rd parties as in the case of the ChatGPT Extension, as soon as you accept the T&Cs.

Screenshot of Siri, Dictation & Privacy terms

For more tips on setting up a highly performant and secure Mac, find yourself a good Hardening Guide, and follow it. Or just ask your friendly local IT desk 😺

Don't forget to check your Firewall, folks!

Installing LM Studio

Get LM Studio for free by navigating to the official website: https://lmstudio.ai/ or the open source repositories: https://github.com/lmstudio-ai (MIT license)

Note that everything I cover here is also possible with the Windows and Linux editions. You can even run LM Studio on Raspberry Pi and other ARM devices.

LM Studio - Local AI on your computer

Run local AI models like gpt-oss, Llama, Gemma, Qwen, and DeepSeek privately on your computer.

LM Studio

Installation takes less than 10 sec

LM Studio welcome screen

Once you go through the initial wizard, you should be presented with Mission Control, where you can search for models of your choice from the huge catalog at Hugging Face. Type 'Apertus' to see a list of 14 versions (at time of writing).

Click the gear on the bottom right to bring up Mission Control

For use on this MacBook, I picked Apertus-8B-Instruct-2509-bf16, a remix of the model that is published by mlx-community. Click Download, noting that it is unquantized, offered here in the same BF16 (Brain Floating Point) precision as the stock 8B model. You can also use my Apertus-8B-Instruct-2509-mlx version if you prefer.

Load model

It took about 15 minutes to download the model on a fast connection. When I click on the Load model button in the screenshot below, a "Failed to load the model" dialog cautions me about insufficient system resources. I have disabled this guardrail (set to OFF), and have not yet seen issues with system performance. Watching my memory consumption, LM Studio has not yet used up the available 15 - 20 GB, even with a full context window. This warning is an issue I'll investigate further.

If all goes well, after you select the model, LM Studio will begin to load it into the memory of your machine. This warm-up process takes me about 15-20 seconds.

The first time you use the model, you will see a dialog explaining RAG.

Now you can start chatting, uploading files, heating up the room – all in the privacy of your local GPU. It may not work as quickly or comprehensively as you are used to, but at all times you remain in control of your algorithm.

A few other gotchas to be aware of:

When I run out of context window (uploading too many files into the RAG, having a very long prompt / system prompt, or using tools), the message appears "The AI has nothing to say". Keep an eye on the status bar, where it is clear what is going on - that's the 143.5% full message in my screenshot:

Another is the memory usage mentioned above. I am using the btop++ utility to keep track of it in real-time. Here you can see the market drop-off in memory usage as models are unloaded and reloaded:

LM Studio carefully manages memory, in particular when multiple models come into play. There are some relevant options at the bottom of the Settings panel:

The Local LLM Service makes it possible to chat with Apertus from the LM Studio CLI, which gives me a handy way (--stats option) to test performance. Here you can see that it takes about half a second to first token, and the M4 is pumping out a modest but reasonable 6.46 tokens per second (for comparison, we humans on average read at 5 tokens per second):

You can also use other command line tools, like the llm utility:

I'd suggest dialing down the temperature on AI humor ...

LLMs on the command line – Applied LLMs

The Unix command-line philosophy has always been about joining different tools together to solve larger problems.

Applied LLMs

In the talk Simon Willison discusses LLMs on the command line, describes the LLM CLI tool in more detail, and explains how it compares to others like aichat.

Or, better yet, tap into the wisdom of Apertus from your own programs:

Learning about the history of Swiss architecture

System prompts

As mentioned in my previous blog post, I grew up with the early Apple machines, so as a fun exercise I decided to teach Apertus the anachronistic language of BASIC. You can do things like this by editing the System Prompt – just option click on your chat window.

Here you can insert a (carefully examined) set of instructions. These could be the recommended starting prompts from the Apertus Tech Report, that you are free to copy from the PDF or in my pre-formatted notebook:

Apertus System Prompt - HackMD

As referenced from:

HackMD

Or, you could add a creative take, like this System Prompt for turning your LLM into a BASIC emulator:

Emulating GW-BASIC syntax using an LLM

Emulating GW-BASIC syntax using an LLM. GitHub Gist: instantly share code, notes, and snippets.

Gist262588213843476

Extra points for enabling dark mode 😎 or sending me screenshots of a cool retro terminal.

Further alternatives

While this post focuses on LM Studio, there are several other good alternatives for local inference. This overview of local LLM software features is from running-llms-locally:

Ollama is one that I already use a lot – it has a rather minimalistic design, and relies on configuration files and CLIs to tweak exactly to your liking. vLLM is used by the Apertus developers, and I got both running fine on the Macbook.

Screenshot of the Ollama desktop client running the Olmo 3 model

Right now, Apertus can only be loaded in Ollama on the command line from community Hugging Face remixes (GGUF), as described in my earlier blog post:

110 #apertus #instruere

Learn about my initial experiences working with the Apertus Large Language Model during the Swiss {ai} Weeks, with advice on getting started yourself.

log.alets.chOleg Lavrovsky

LM Studio also features a powerful CLI, and there are instructions out there on how to get started for your workflows. For coders, I recommend using the API as a Local Model Connection in Tabby:

Tabby - Opensource, self-hosted AI coding assistant

Tabby is an open-source AI coding assistant that empowers developers to code faster and smarter. Discover a self-contained alternative to GitHub Copilot, tailored for your development needs.

Mistral AI logo

Here is a guide to do the same thing with opencode:

Running a local coding agent with LM Studio and OpenCode | ~/adi

I’ve been a huge fan of Claude Code since it launched. Over the past few months, I’ve been using it extensively across all kinds of projects. Claude Code is still the best tool out there, though others (Gemini CLI) are catching up. I recently discovered OpenCode, an open-source model agnostic framework that supports local models, and used it to test gpt-oss-20b, qwen3-coder-30b — currently the best open source coding models with tool calling.

~/adi

Some other posts that were helpful for this guide:

LM Studio 0.3.4 ships with Apple MLX

Super fast and efficient on-device LLM inferencing using MLX for Apple Silicon Macs.

LM Studio Blog

Run LLMs Locally on Mac with LM Studio

A step-by-step guide to running open-source LLMs on your Mac using LM Studio and Apple’s MLX framework for GPU acceleration.

GetDeploying

Apple has a sleeper advantage when it comes to local LLMs

Local LLMs thrive on Apple’s hardware, and a huge part of it is thanks to MLX.

XDAAdam Conway

Running LLMs Locally on Your Mac: A Deep Dive into MLX Performance on the M4 Max

When Apple announced the M4 series processors with their emphasis on neural network capabilities, I pulled the trigger on a MacBook Pro M4 Max with 36GB of unified memory. While I sometimes wish I’d sprung for 64GB (hindsight is 20/20), this machine has become an essential part of my AI research set

LinkedInTravis Lelle

What aspects of AI self-determination would you like me to cover in future blog posts? Feel free to get in touch if you'd like help, or want to share your own local prompting setup!

A clean & pristine Mac on a colleague's desk.