Running LLMs on your own hardware: what's possible

← Back to all articles

When most people think "AI," they think of a chatbot in a browser, talking to a giant model in someone else's data centre. But you can also run capable AI models entirely on your own machines — no cloud, no subscription, and not a single byte of your data leaving the building.

These are called local LLMs (large language models), and they've quietly become good enough for real work. For businesses handling sensitive data — legal, medical, financial, or anything covered by strict privacy rules — that's a big deal. Here's what's actually possible today, and what it takes to run it.

Why run a model locally at all?

Privacy. The data never leaves your premises. For confidential documents, client records, or anything you simply can't send to a third party, this is often the whole reason.
No per-use fees. Cloud AI bills you per request. A local model has an upfront hardware cost, then runs as much as you like for the price of electricity.
It works offline. No internet, no outage risk, no "the API is down today."
Control. You decide which model, which version, and when it changes. Nothing updates underneath you and breaks a workflow.

The trade-off: a local model is usually a step behind the very best cloud models, and you're responsible for the hardware. For many tasks, that gap doesn't matter at all.

What local models can realistically do

Today's open models — the Llama, Mistral, Qwen, and Gemma families, among others — handle a wide range of everyday business tasks well:

Summarising documents, meetings, and long email threads
Drafting and rewriting routine text
Answering questions over your own documents (private knowledge base)
Extracting structured data from messy text and forms
Classifying, tagging, and routing incoming information
Transcribing audio privately (speech models run locally too)

Where they're weaker is the very hardest reasoning, niche expert knowledge, and the largest, most complex tasks — that's still where the frontier cloud models pull ahead. The honest rule of thumb: if the task is routine and high-volume, local is often perfect. If it's a one-off that needs the absolute best answer, the cloud may still win.

The hardware you actually need

This is the part everyone wants a straight answer on. The single most important component is the GPU (graphics card), specifically how much VRAM (video memory) it has — because the whole model has to fit in memory to run quickly. Model size is measured in "parameters" (e.g. 7B = 7 billion), and bigger models are more capable but need more memory.

Models are usually "quantised" (compressed) to run on less memory with little quality loss, so these are realistic working figures:

Entry level — a single consumer GPU

Hardware: one gaming-grade GPU with 12–16 GB VRAM (e.g. an RTX 4060 Ti 16GB or 4070).
Runs: 7B–8B models comfortably, and 13B models when quantised.
Good for: a small team's everyday tasks — summaries, drafting, document Q&A. A capable workstation, roughly the cost of a decent laptop.

Serious workstation

Hardware: a high-end GPU with 24 GB VRAM (e.g. RTX 4090 / 3090), 32 GB+ system RAM, fast SSD.
Runs: 30B-class models smoothly, 70B models when heavily quantised.
Good for: a power user or a shared model serving a department. This is the sweet spot for most businesses dipping in seriously.

Small server — the whole company

Hardware: dual high-VRAM GPUs, or a professional card (48 GB+), running as a shared on-prem server.
Runs: large 70B models well, serving many people at once.
Good for: an organisation that wants one private AI service everyone uses, with no data ever leaving the network.

A note on Apple: modern Macs with Apple Silicon (M-series) and lots of "unified memory" — 32 GB, 64 GB, 128 GB — are surprisingly good for this, because the model can use that shared memory pool. A well-specced Mac is a genuinely practical way to run sizeable local models without building a GPU rig.

The rule to remember: VRAM is the ceiling. A bigger, faster GPU helps, but if the model doesn't fit in memory, it either won't run or will crawl. Buy memory first.

You don't have to host it yourself, either

"Local" doesn't have to mean a machine under your desk. A middle path is renting a private cloud GPU or using a self-hosted setup on infrastructure you control — you still get isolation and ownership, without buying hardware up front. We help clients pick whichever sits right for their privacy needs and budget.

The bottom line

Local LLMs have crossed the line from "interesting experiment" to "genuinely useful for real business work" — especially when privacy is non-negotiable. You don't need a data centre. For a lot of teams, a single good workstation runs a private AI assistant that never phones home. The trick is matching the model and the hardware to the job, rather than buying the biggest of everything.

Curious whether a private, local AI setup fits your business?

We've built fully offline AI tools before (it's literally what WhisperWindows does). Let's talk about what would work for your data and budget.

Get in touch