119 #apertus #mlx
The post describes the process of installing and using Apertus MLX builds on an Apple MacBook. Details the installation of LM Studio, with some basic performance testing, and provides links to more in-depth analysis and further references. If you feel this is all too much, and would just like to use a ready-to-go service online, see my previous post for more direct waypoints:
Still with me? 👌😁 OK!
The laptop on my desk has a current M4 chip and 24 GB of RAM (non Pro), currently retailing at under 1500 CHF. The new Macbooks have been widely discussed (along with the Pro and mini M4 editions) as an excellent choice for users and developers. I have seen Apertus running impressively well on a Mac mini, and wondered how much mileage one could get out of a mid-range laptop with around 16 GB memory for loading AI models. The laptop reviewed here is on loan from the AI Center at EPFL, where I've just started work 🎉
Note: the statements in my blog are my own and not that of my employer.
It's great to see Apertus making top headlines in January, let's keep this up!
Why use local AI?
2026 is the year of AI on the desktop, with every IT company rushing out with updates to hardware and software that enables various levels of intelligent service. Some people find this an alarming trend - planned obsolescence is certainly not good for the environment. Meanwhile, the ability to use an AI model while offline, or in order to better understand and develop AI services, is a clear ask. Learning to become more proficient with local AI, may also help to stem some of our wasteful dependence on heavy cloud services.
Some arguments for being aware and minimizing the footprint of desktop AI agents was recently covered in an insightful talk by Whittaker and Tiwari, yet the actual practice of installing LLM and clarity around hardware requirements and capabilities is not yet there, at least not at customer level. The dark patterns and underwhelming safeguards are part of a slippery slope towards losing our digital self-determination.
For many people, the “luxury” of cloud-less computing comes down to the hardware: do you have a machine that can run the model with the performance you need? Are you able to invest the time into building a Home Lab? Much progress is being made on tiny models and CPU (rather than GPU) optimization, that makes the technology available everywhere. Clearly, we need lot of computational juice (and corresponding energy usage) to achieve some level of parity with what we are used to online – and in any case, we need to be ready to sacrifice some speed.
As open hardware fan, I would have loved to go for a Swiss-made machine, like the Why! Laptop, whose gamer models with NVidia chips have enough video RAM. The StarLabs, Framework and Tuxedo laptops all take a while longer to ship. The Ryzen AI models are also ones to watch out for: like the Gen10 from Tuxedo: with 32 GB and the top-level HX 370 chip, it currently retails for around 1800 Euros. Over time, I aim to replace my Arch Linux-powered Latitude and home workstations with one of these slick and sustainably-manufactured machines.
Quite comfortable with Linux and Windows operating systems, I've decided to give macOS a fresh look, also to benefit my interactions with a wider range of the user community. Oh, that keyboard layout will take some getting used to.
Yea, yea, but how do you make a tilde (~) sign again?
What's the deal with MLX?
As announced by Apple last March, the M4 chip on the MacBook performs a cut above its predecessors, including the M3 Pro – at least in synthetic benchmarks. The edge on AI tasks is helped by interesting features, like the Scalable Matrix Extension - benchmarked and explored in more detail here:
MLX is a framework that came out with the M1 chip that leverages Apple Silicon’s unified memory architecture and Metal GPU acceleration for efficient CPU/GPU operations. It simplifies both inference operations and AI app development on the new chips: enables local training of models using device-locked data, with support for low-rank adapters (LoRA) and quantized training for efficiency. Along with Apple Intelligence, it gives Mac users a ticket to secure, offline-capable AI applications for sensitive tasks. You can read all about it in the developer site:
Through an optimizing of the weights and models, a model file is produced in a format that works magically well on Apple's chips. As a user of LM Studio, you do not need to do this yourself, as community builds are available. As a developer, or advanced user who wants to use a fresh model release, the procedure is relatively straightforward.
I converted the latest Apertus 8B model build to MLX using these commands, the process taking about 25 minutes on the Macbook:
git clone https://github.com/ml-explore/mlx-lm.git
uv pip install mlx_lm mlx
uv run mlx_lm convert --hf-path swiss-ai/Apertus-8B-Instruct-2509(the setuptools didn't work quite right hence the lazy pip install)
In my tutorial below, I will use the mlx-community builds which exist in several quantized forms, and that you can download directly in LM Studio.
Getting started
In the benchmarking business, one likes to start with a clean slate: by making sure to disable Siri and Apple Intelligence on first boot, I can help to ensure that as much memory and compute capacity as possible is leveraged by the services of my choice.
I must give a hat-tip to Apple for making it quite easy to disable or re-enable at any point: just be aware that data will be collected, and shared with 3rd parties as in the case of the ChatGPT Extension, as soon as you accept the T&Cs.
For more tips on setting up a highly performant and secure Mac, find yourself a good Hardening Guide, and follow it. Or just ask your friendly local IT desk 😺
Installing LM Studio
Get LM Studio for free by navigating to the official website: https://lmstudio.ai/ or the open source repositories: https://github.com/lmstudio-ai (MIT license)
Note that everything I cover here is also possible with the Windows and Linux editions. You can even run LM Studio on Raspberry Pi and other ARM devices.
Once you go through the initial wizard, you should be presented with Mission Control, where you can search for models of your choice from the huge catalog at Hugging Face. Type 'Apertus' to see a list of 14 versions (at time of writing).
For use on this MacBook, I picked Apertus-8B-Instruct-2509-bf16, a remix of the model that is published by mlx-community. Click Download, noting that it is unquantized, offered here in the same BF16 (Brain Floating Point) precision as the stock 8B model.
Load model
It took about 15 minutes to download the model on a fast connection. When I click on the Load model button in the screenshot below, a "Failed to load the model" dialog cautions me about insufficient system resources. I have disabled this guardrail (set to OFF), and have not yet seen issues with system performance. Watching my memory consumption, LM Studio has not yet used up the available 15 - 20 GB, even with a full context window. This warning is an issue I'll investigate further.
If all goes well, after you select the model, LM Studio will begin to load it into the memory of your machine. This warm-up process takes me about 15-20 seconds.
Now you can start chatting, uploading files, heating up the room – all in the privacy of your local GPU. It may not work as quickly or comprehensively as you are used to, but at all times you remain in control of your algorithm.
A few other gotchas to be aware of:
When I run out of context window (uploading too many files into the RAG, having a very long prompt / system prompt, or using tools), the message appears "The AI has nothing to say". Keep an eye on the status bar, where it is clear what is going on - that's the 143.5% full message in my screenshot:
Another is the memory usage mentioned above. I am using the btop++ utility to keep track of it in real-time. Here you can see the market drop-off in memory usage as models are unloaded and reloaded:
LM Studio carefully manages memory, in particular when multiple models come into play. There are some relevant options at the bottom of the Settings panel:
The Local LLM Service makes it possible to chat with Apertus from the LM Studio CLI, which gives me a handy way (--stats option) to test performance. Here you can see that it takes about half a second to first token, and the M4 is pumping out a modest but reasonable 6.46 tokens per second (for comparison, we humans on average read at 5 tokens per second):
You can also use other command line tools, like the llm utility:
In the talk Simon Willison discusses LLMs on the command line, describes the LLM CLI tool in more detail, and explains how it compares to others like aichat.
Or, better yet, tap into the local power of Apertus from your own programs:
System prompts
As mentioned in my previous blog post, I grew up with the early Apple machines, so as a fun exercise I decided to teach Apertus the anachronistic language of BASIC. You can do things like this by editing the System Prompt – just option click on your chat window.
Here you can insert a (carefully examined) set of instructions. These could be the recommended starting prompts from the Apertus Tech Report, that you are free to copy from the PDF or in my pre-formatted notebook:
Or, you could add a creative take, like this System Prompt for turning your LLM into a BASIC emulator:
Extra points for enabling dark mode 😎 or a cool retro terminal.
Further alternatives
While this post focuses on LM Studio, there are several other good alternatives for local inference. This overview of local LLM software features is from running-llms-locally:
Ollama is one that I already use a lot – it has a rather minimalistic design, and relies on configuration files and CLIs to tweak exactly to your liking. vLLM is used by the Apertus developers, and I got both running fine on the Macbook.
Right now, Apertus can only be loaded in Ollama on the command line from community Hugging Face remixes (GGUF), as described in my earlier blog post:
LM Studio also features a powerful CLI, and there are instructions out there on how to get started for your workflows. For coders, I recommend using the API as a Local Model Connection in Tabby:
Here is a guide to do the same thing with opencode:
Some other posts that were helpful for this guide:
What aspects of AI self-determination would you like me to cover in future blog posts? Feel free to get in touch if you'd like help, or want to share your own local prompting setup!