Run Language Models on Your Computer with LM-Studio
A practical guide to running local models and picking the right one for speed or accuracy.
Good Morning,
You can do so much with AI. The building and DIY aspect also keeps getting more nuanced, and more powerful. Open-source AI is providing a new array of capabilities even at the local and individual level.
I’m a huge fan of Benjamin Marie and I’ve wanted to share more about his work for so long. Today, we finally have the chance. Ben is an independent AI researcher (LLM, NLP) with two really useful blogs and I have a huge respect for his work: (don’t let the funny names fool you, these are serious resources).
The Kaitchup – AI on a Budget 🍅
Hands on AI tutorials and news on how adapting language language models in a DIY setting to your tasks and hardware using the most recent techniques and models.
The Kaitchup publishes invaluable weekly tutorials with info that’s hard to find elsewhere.
By being a paid subscriber to The Kaitchup, you also get access to all the AI notebooks (160+), hands-on tutorials, and more in-depth analyses of recently published scientific papers.
Read The Salt 🧂
Reviews and in-depth analysis of bleeding edge AI research and how-tos. The Salt is a newsletter for readers who are curious about the Science behind AI. If you want to stay informed of recent progress in AI without reading much, The Salt is for you! I do my best to offer articles that might be interesting for a wide variety of readers.
Benjamin’s technical and practical knowledge is invaluable depending on how deep down the rabbit-hole you want to go in DIY with models. It’s not overly technical but it is on technical topics, useful for a wide range of readers interested experimenting locally DIY with models or in small teams.
Selected Works
I asked him for a basic beginners tutorial on how to run LLMs locally (something I sometimes get questions about). He’s able to add so much practical know-how and insights into the latest models where for me he is an authority. If a new model comes out his opinion represents hands-on experience and being up to date on the latest scientific papers.
Qwen3-VL: DeepStack Fusion, Interleaved-MRoPE, and a Native 256K Interleaved Context Window.
Did the Model See the Benchmark During Training? Detecting LLM Contamination
Making LLMs Think Longer: Context, State, and Post-Training Tricks
Benjamin Marie is an independent researcher focused on hands-on AI and the tools around modern language models. He helps people and companies cut costs by adapting models to their specific tasks and hardware. I hope you learn something from it. While my work doesn’t touch on machine learning professionals that much, more and more individuals and small teals are playing with these open-source models locally.
So I’m very proud to be able to bring you a guide like this:
Run Language Models on Your Computer with LM-Studio
A practical guide to running local models and picking the right one for speed or accuracy.
Running large language models (LLMs) locally used to mean wrestling with the GPU’s software layer (like CUDA), scattered model formats, and a lot of trial-and-error. Today, it’s surprisingly approachable. With tools like Ollama or LM Studio, you can download a model, load it in a few clicks, and start chatting on your own machine, without sending prompts to a cloud service.
This article walks through the practical path from “installing the app” to “running my first local model,” and then zooms out to the part that really matters: what determines whether a model runs smoothly (or not) on your hardware. Along the way, we’ll cover installing LM Studio, the memory (simple) math behind model sizes, how to pick trustworthy GGUF builds and compression levels, how to sanity-check model output, and why “thinking” models can be dramatically better on hard prompts while also being noticeably slower.
The goal is not to turn you into an engineer. It’s to give you enough intuition to choose models confidently, understand what an application like LM Studio is telling you, and avoid the most common “why is this slow / why is this wrong” surprises.
By the end, you’ll (1) install LM Studio, (2) pick a model that fits your computer, and (3) understand the three parameters that control the experience: model size, quantization, and context length.
Note: I added a mini glossary at the end of the article to help you with the most technical terms.
Installing LM Studio
LM Studio is increasingly popular for running LLMs locally on your own computer. I find it very user-friendly, and it supports Windows, macOS, and Linux. It’s currently my top choice for Windows and macOS.
To install LM Studio, go to their website:
Download the installer (for example, on Windows, click “Download for Windows”). Open it, and you can complete the installation in just three clicks:
Next → Install → Finish
That’s the easy part.
Requirements to Run LLMs on Your Computer
Before we go further, it helps to understand what hardware you need to run LLMs locally. In practice, the compute and, especially, the available memory on your machine will be the main factor that determines which models you can run (and how comfortably).
Local LLM performance comes down to 3 things:
Memory (can it fit?)
Speed (how fast your GPU/CPU is)
Settings (context length + how long an answer you request)
LLM size in parameters
LLMs are neural networks, and their “size” is usually described by the number of parameters (think: learned weights). As a rule of thumb, more parameters often means a more capable model, though that’s not always true when comparing models across different generations.
Most of the time, the parameter count is included in the model name:
gemma-3-4b-it → ~4B (4 billion) parameters
Meta-Llama-3.1-8B-Instruct → ~8B parameters
Note: Some very large models (especially those not primarily intended for local deployment) don’t always include the parameter count in the name. In those cases, check the model card (the model’s documentation page: size, usage, limits, license) for the official number.
Model card of DeepSeek-V3.2. The red circle shows where you can find the number of parameters. DeepSeek-V3.2 has 685B parameters.
How parameters translate to memory
Bear with me as this part gets a bit technical, but we’ll get through it quickly.
Once you can estimate the parameter count, the next question is: how much memory do you need to load the model?
A simple approximation:
Many model parameters are stored in 16-bit precision.
16 bits = 2 bytes, so each parameter uses roughly 2 bytes of memory.
That means:
1B parameters → ~2 GB (just for the parameters)
4B parameters → ~8 GB
8B parameters → ~16 GB
And that’s only to load the model. To run it, you also need extra memory for things like the KV cache, temporary buffers, and runtime overhead.
KV cache is the model’s “working memory” for the current conversation; it grows as your context gets longer.
Buffers/overhead are the scratch space the program needs to do the math.
A solid rule of thumb is to budget ~20% extra on top of the model parameters.
So, for Gemma 3 4B IT, which has 4B parameters, you would need 4B * 2 = 8 GB + 20% = 10 GB of memory.
Takeaway: If a model barely fits in VRAM, it may run, but it’ll often be slower and less stable. Having a few GB of headroom helps.
Fortunately, we will see that LM Studio can tell whether you have enough memory to run a model.
Which “memory” matters?
This is about fast memory, not storage. Your SSD/HDD is far too slow to run LLMs.
Ideally, the model lives in GPU VRAM (graphics card memory). For example, an NVIDIA RTX 4090 has 24 GB of VRAM, which is enough for roughly a standard ~10B-class model (with some headroom for runtime overhead).
No GPU? Still possible.
You can also run LLMs on a CPU, loading the model into system RAM. It works, but it’s typically much slower, especially as models get larger.
Macs can be particularly good at this thanks to unified memory, which allows the GPU and CPU to share a large, high-bandwidth memory pool.
What I’ll use in this article
In this article, I’ll run models on a modest RTX 3060 GPU with 12 GB of VRAM. That means we’ll focus on smaller models (or quantized versions of larger ones). If 12 GB is more than your laptop or desktop has available, don’t worry, we’ll also see how to choose smaller variants of the same model family.
Launch your First LLM
After this technical aparte, let’s go back to LM Studio. Start it.
The first time you launch it, if everything goes well, you’ll see this:
Don’t worry too much about the first screens, just click through. We’ll configure everything ourselves.
Once you reach the main screen, we’ll start by choosing a model. Press Ctrl+L (or click “Select a model to load”).
Let’s choose a model that can also process images.
Search for “gemma 3 4b it”.
Select the model gemma-3-4b-it-GGUF provided by Unsloth. And then, inspect the right panel. You’ll see there are actually various models.
Q4_0, Q5_K_S, IQ4_NL… You’ll notice that many models come in multiple variants. These are GGUF builds: models packaged into a single file for easier local deployment. Think of GGUF as a “zip-friendly” packaging of a model designed for easy local use.
Important: I see people often forget this, but GGUF files are usually not the official releases from the original model provider. They’re most often community-made conversions, and they can behave slightly differently from the original model. They’re also typically quantized (i.e., compressed), which almost always causes some quality loss, and in rare cases can seriously degrade or even “break” a model if the quantization is too aggressive.
If a model looks unexpectedly bad in your tests, it doesn’t mean the original model is also bad.
Rule: Prefer GGUFs from well-known publishers (bartowski, unsloth, and ggml-org (the creators of GGUF)) or the original model maker.
You can see the publisher name under the model name in LM Studio (as in the screenshot above).
Sometimes, the original provider also publishes GGUF versions. For example, Google has released official GGUF builds for Gemma models, so you may find Gemma 3 GGUF published directly by Google.
Why use GGUF instead of the original model?
GGUF models are designed to be easy to use and to run well on consumer hardware, including machines without a GPU. By contrast, the “official” model is often distributed in 16-bit precision, which require significantly more memory and typically benefit from a more optimized stack to run efficiently.
How do you choose among GGUF variants?
I explain in details the different quantization options and how to pick the right one here:
Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats
·
October 13, 2025
TL;DR: The number after “Q” in the model’s name (e.g., Q4_K_S) indicates the model’s quantization level (i.e., effective precision). Higher is usually better for quality, but it also uses more memory. LM Studio will show a green check for variants it estimates can run on your GPU (it’s a helpful guideline, but still only an estimate).
Beyond the Q level, the specific quantization format can affect speed and make a model better suited to certain use cases. The table below gives you a quick overview:
If the letters look scary: you can ignore most of them. For most people: Start with Q4_K_S (balanced), then try Q5 if you have spare VRAM.
Let’s use Q4_K_S for Gemma 3 4B IT.
Click Download.
Then click “Load Model.” You’ll be taken to the chat interface, where you can test the model. For example, try a simple prompt: Hello!
The model responds, so it’s working.
Here’s what you’re seeing:
The answer, obviously 🙂
And some useful stats just below it:
91.30 tok/sec: the generation speed, measured in tokens per second. A token is a unit of text: an English word is often 1–2 tokens (sometimes more). Speed mainly depends on your GPU and the model size.
68 tokens: how many tokens the model generated for this reply.
0.28s to first token: roughly how long the model took to process your prompt before starting to generate.
Stop reason: EOS Token Found: the model decided it was done and emitted an end-of-sequence (EOS) token. This is usually what you want. Another common outcome is a hard stop when the model hits the maximum context length. In that case, the answer may be cut off.
Cheat sheet:
tok/sec = speed (higher is faster)
time to first token = “thinking/processing delay”
stop reason: EOS = normal ending
cut off = likely hit max context or max output
Now that we’ve covered the basics, let’s try something more fun.
Gemma 3 can read images, so you can ask questions about a picture. For example, I’ll ask it whether it recognizes this flag:
You can attach an image by clicking here, in the dialogue box:
Here is my interaction:
The model replied that this is the flag of Eritrea. If you know your flags, you’ll spot the mistake: it’s actually Equatorial Guinea.
I picked this example as a reminder that models, especially smaller local ones, can be extremely useful, but they can also be confidently wrong while still sounding perfectly plausible. Here, it seemed to read the central emblem reasonably well, but it got the color layout wrong.
When accuracy matters, do this:
Ask for sources or step-by-step checks (when appropriate)
Cross-check with a second model / quick search
Test with 3–5 small prompts before trusting a model
Let’s see whether it can recognize its mistake if we push it a bit:
The model quickly acknowledged a mistake, but then produced another wrong answer. Lesson learned: don’t rely on this model for flag recognition.
That said, it is good at many things. For example, it handled basic visual detail/character recognition reasonably well, since it could make sense of the central emblem.
Let’s try something else. On the right-hand panel, you can access a few useful model settings.
I won’t go too deep into the settings here. The most important one is probably Context. This is where you can add instructions about how you want the model to behave, for example:
details about you or your project,
the tone you want (concise, friendly, formal, etc.),
rules the model should follow.
A system prompt is the “highest priority instruction” that sets the model’s role.
Let’s try this as a system prompt:
You are a professional translator. Whatever I say, you translate it into Japanese. You write nothing else.
And it works:
In fact, Gemma 3 is quite strong in multilingual tasks. Even though it’s about a year old, it’s still one of the best models for translating English into other languages.
Running Thinking Models
Now let’s try a more recent type of model that’s often much more accurate on difficult tasks: a thinking model.
Many commercial models today are “thinking” models. In practice, they generate an internal chain of thinking (a thinking trace), which is often hidden from the user. This extra internal work can dramatically improve answer quality.
There are a few common patterns behind the improvement:
Decomposition
The model breaks a problem into smaller pieces (even if you didn’t ask it to).Search + self-check
It tries a path, notices something doesn’t fit, and revises.Constraint tracking
It keeps more “rules” in mind (especially helpful for multi-step logic, coding, math word problems, and planning).Less “first-token anchoring”
Fast models can commit early to an answer and then rationalize it. Thinking models are often better at not locking in too soon.
The main downside is speed. Some of the best thinking models may generate 50,000+ tokens internally before producing a final response. If your GPU runs at 100 tokens/sec, a quick calculation gives:
50,000 / 100 = 500 seconds → about 8 minutes
So… patience helps.
Let’s try one: Qwen3 4B Thinking. This is one of the best small thinking models that you can find today, made by the Qwen team (Alibaba).
Note: If you don’t want to wait for thinking, there’s also an “Instruct” version (similar in behavior to the Gemma 3 4B we just used). It doesn’t “think,” so it responds much faster, usually at the cost of lower accuracy on harder prompts.
Use Instruct (fast) if: chat, rewriting, summarizing, quick ideas
Use Thinking (slow) if: multi-step, constraints, debugging, planning, ambiguous tasks
If unsure: start Instruct → switch to Thinking if you get mistakes
Enter “Qwen3 4B Thinking” in the model search text box:
Let’s take the official one released by the Qwen team. I have selected the variant Q4_K_M as shown in the screenshot above. Feel free to try a different one.
Let’s press CTRL+R for a new chat:
It took much more time. The model “thought” about an appropriate answer first, and then answered. You can see its thinking trace by clicking on “> Thought…”:
As a general rule, don’t try to make sense of a model’s thinking trace. Currently, they are, for most models, readable tokens that make sense to humans. But making sense to us is not their purpose. It may include:
false starts
contradictions
unsafe content
“confident-sounding” nonsense
So: treat it like watching someone mutter while solving a puzzle. Interesting, sometimes useful, but not authoritative.
Let’s try to prompt it with something more interesting:
That’s a good and detailed answer. It took 5.8 seconds for the model to think about it.
When a thinking model is worth it
Thinking models shine when your prompt has at least one of these properties:
Multi-step (needs intermediate results)
Constraint-heavy (“must include X, avoid Y, format Z”)
Ambiguous (needs disambiguation and a cautious approach)
Debugging (code or logical reasoning with error checking)
Long-horizon planning (tradeoffs, ordering, dependencies)
Examples that often benefit:
“Here are 12 requirements. Design a plan that satisfies all of them.”
“Find the bug in this function and explain how you know.”
“Solve this logic puzzle; verify the result.”
When it’s not worth it
For many everyday tasks, thinking is wasted compute:
casual chat
simple rewriting
summarizing short text
“what is X?” definitions
brainstorming lots of ideas quickly
In those cases, the Instruct version is usually the better experience.
If waiting becomes annoying, you can often cut latency by changing how you ask:
Ask for a shorter output
Provide cleaner constraints (less ambiguity = less internal searching)
Use a smaller context (remove irrelevant text)
Choose a smaller thinking model (or a lighter quantization)
If your UI supports it: reduce max output tokens
A surprisingly common cause of slowness isn’t the “thinking” itself, it’s the model generating a huge answer you didn’t need (like in my last interaction, where the model gave me a lot of information I didn’t ask for).
Conclusion
At this point, you’ve done the essential thing: you’ve taken a local model from download to a working chat, and you’ve learned what’s happening under the hood well enough to make informed choices.
The big takeaway is that “can I run this model?” is mostly a memory question. Parameter count and precision translate directly into how much GPU VRAM (or CPU RAM) you need, and quantization is the lever that makes modern models usable on consumer hardware without requiring a data-center GPU.
You’ve also seen the tradeoffs that define local LLMs in practice. GGUF builds are convenient and often excellent, but they’re usually community conversions, and more aggressive quantization can quietly shave off quality or occasionally destabilize a model.
Finally, thinking models add a new dimension: they can be much more reliable on difficult tasks because they spend extra compute “thinking,” but that reliability comes with latency. In other words, local LLMs aren’t a single experience, they’re a toolkit. Sometimes you want fast and good-enough. Sometimes you want slower but more careful. Once you understand that, you can pick the right model for the job.
Mini glossary
LLM (Large Language Model): An AI model trained on large amounts of text that can generate and understand language.
Run locally: Running the model on your own computer instead of on a cloud server.
GPU: A processor optimized for parallel math; often much faster than a CPU for LLMs.
CPU: Your computer’s general-purpose processor; can run LLMs but usually slower.
VRAM: The GPU’s dedicated fast memory (critical for fitting and running models on the GPU).
RAM (system memory): The computer’s main memory used by the CPU (and sometimes for running models without a GPU).
Unified memory: A design (common on Macs) where CPU and GPU share one memory pool.
Compute: The amount of processing work required to run a model.
Model format: The file packaging standard for model weights (some tools support only certain formats).
Parameters / weights: The learned numbers inside the model that store its capabilities.
Parameter count (e.g., 4B, 8B, 685B): How many parameters the model has; “B” means billions.
16-bit precision / 16-bit weights: Storing model numbers with 16 bits each; higher quality but higher memory use.
Quantization: Compressing a model by storing weights with fewer bits (less memory, often faster, usually some quality loss).
Quantized model: A model that has been quantized (compressed).
Effective precision: The “real” precision after quantization (how many bits the weights effectively use).
GGUF: A popular single-file format for local LLMs, especially for easy desktop usage.
GGUF build: A specific packaged GGUF file/version of a model.
Community conversion: A GGUF (or other format) created by someone other than the original model maker by converting it.
Original/official release: The model files published by the model’s creators.
Publisher (in LM Studio): The account/group that uploaded that specific model file (not necessarily the original creator).
KV cache: Extra memory used during generation to “remember” the conversation efficiently; it grows with context length.
Temporary buffers: Scratch memory used for intermediate calculations while the model runs.
Runtime overhead: Extra memory the app/runtime needs beyond the model weights (caches, buffers, bookkeeping).
Optimized stack: The combination of software components that make inference fast (runtime + libraries + GPU kernels).
Token: A chunk of text the model processes (often ~1–2 English words, but varies).
Tokens/sec (tok/sec): How fast the model generates text (speed metric).
Time to first token: How long it takes before the model starts outputting (latency metric).
EOS token (end-of-sequence): A special token meaning “I’m done.”
Context: The text the model can “see” while answering (instructions + chat history + your prompt).
Maximum context length: The maximum number of tokens the model can consider at once; hitting it can cut off outputs.
System prompt: A high-priority instruction that sets the model’s role/rules.
Reasoning / thinking model: A model variant that does extra internal work before answering, often improving accuracy on hard tasks.
Reasoning trace / chain of reasoning: The model’s internal step-by-step thought process; sometimes shown, often hidden.
Decomposition: Breaking a problem into smaller steps.
Search + self-check: Trying an approach, checking it, and revising if it doesn’t fit.
Constraint tracking: Keeping multiple rules/requirements in mind while generating.
First-token anchoring: A failure mode where the model commits too early to an answer and then rationalizes it.
Instruct model: A variant tuned to follow instructions and respond quickly (often less “deliberate” than thinking models).
Latency: How long you wait for the answer (includes thinking time + generation time).






























Thanks. This is really educational and easy to follow.
I am wondering what the benefits are to installing a local LLM verses using a browser like Comet. I have been loving using Comet and its AI assistant.
I can definitely see how AI on laptops can help find files, organize projects and documents with prompts. However, it feels like we are not there yet. I assume that is what Apple Intelligence’s end goal is. So then why would we need LLM locally, right now?