How to run large language models at home with Ollama

Run LLMs at home with Ollama

If you’d asked the average person on the street 14 months ago what a large language model (LLM) was, you’d probably have been met with a slightly puzzled expression. But at the end of 2022, ChatGPT’s arrival smashed LLMs into the public consciousness. Now, you’re hard-pressed to find someone who hasn’t added “AI expert” to their LinkedIn bio, let alone someone who hasn’t used or experienced an LLM.

What many people don’t realise is that running an LLM locally on your own hardware is not only possible, but easier than you might imagine and capable of surprisingly impressive results. There are many reasons you might want to do this, be it cost or privacy, or just for the sheer fun of getting your computer to do impressively cool things.

Diving into the local LLM scene can be a little overwhelming. There are several backends for running LLMs and different frontends, and then you need to get familiar with Hugging Face (the GitHub of LLMs) and figure out the myriad of models, different formats and quantisation levels and, well, you get the picture.

Thankfully, there is a wonderful tool that simplifies this tangled journey into a pleasantly smooth ride. Ollama is a CLI tool that runs on Windows, Mac or Linux, and makes downloading and running LLMs on your machine decidedly uncomplicated.

Installing Ollama

Running an LLM does not necessarily require a high-end PC with state of the art GPUs, but older and slower machines with less than 8GB RAM are definitely going to struggle. In general, 8GB will allow you to run the smallest models, and from there the more RAM, CPU and GPU power you have at your disposal, the bigger and more impressive the models you’ll be able to run.

For now, let’s go ahead and get Ollama installed.

If you’re on Mac, installation couldn’t be easier. Simply download Ollama and install it in your applications. Linux and Windows (via WSL2) can be installed with this one liner:

curl | sh

If running bash scripts downloaded from the Internet gives you the heebie-jeebies, and you’re not afraid to get your hands dirty, you can of course install Ollama manually.

Finding models to download

Ollama is not just software for running LLMs locally. It is also an online repository of LLMs that you can “pull” down onto your machine.

When you browse Ollama’s repository, you’ll discover all the different models they’ve listed. There are quite a few to chose from. At the top of the list you’ll probably see llama2 which is Meta’s open source LLM, and the one that really paved the way for a lot of the innovation in the local LLM scene. If you click on it you’ll see an overview page offering a description of the model, and a tab for “tags” which lists many different variants of the model.

I plan to do a deeper dive into the naming conventions of the model tags and what it all means in a future post. For now, there are a couple of key patterns to bear in mind.

  • Number of parameters - where you see 7b, 13b, and 70b in the name refer to the number of parameters in the model, also known as the weights. These are the result of the training process and are indicative of the model’s size and complexity. Generally, bigger means better but also more demanding on system resources. You’ll probably be able to run a 7b model without much hassle, but to run a 70b model or bigger you’re going to need a high-end computer with at least 64GB of RAM and an expensive GPU.
  • Quantisation level - suffixes like Q3, Q4 and Q5 combined with K_S, K_M and K_L refer to the quantisation level. Quantisation is the process of converting the weights from higher precision values to lower precision - a kind of compression that allows larger models to be used on lower specced machines, with a slight impact on the overall quality of the model. The lower the number, the higher the compression and more noticeable impact on the model.

At home, I run a Mac Studio with M1 Max processor and 32GB RAM. I find that I can just about run some 33b models with Q3_K_M, but performance is slow, and I question if the quality is noticeably better than a 13b model with Q6. Finding the right balance between size and performance for your hardware is something I guarantee you’ll chew up your monthly bandwidth quota doing once you’re up and running with Ollama.

Download the Dolphin Phi model

Lets start with a model most desktops and laptops should be able to handle. The Dolphin Phi model is a fine tune model by Eric Hartford, based on Microsoft’s 2.7b parameter Phi-2 model. I found some frustrating quirks running the base Phi-2 model that ruled it out for this post, but Dolphin Phi seems to work nicely out of the box.

Using Ollama, pull down Dolphin Phi with this command:

ollama pull dolphin-phi

This will pull down the 2.7b-v2.6-q4_0 tag, a 1.6GB file, giving you just enough time to top up your cuppa.

Chat with Dolphin Phi

Once the download completes, you can now use the Ollama REPL to run the model and chat with it in your terminal.

ollama run dolphin-phi

>>> Who are you and what is your purpose?

I am an artificial intelligence designed to assist humans in various tasks by providing information, answering questions, and offering support. My purpose is to enhance the lives of people by providing quick, accurate, and reliable information, while also being accessible and friendly. My main goal is to learn from human interactions so that I can better understand your needs and offer more personalized assistance.

>>> What is the difference between a flat white and a cappucino?

A flat white is an espresso-based coffee beverage made by pouring hot water over finely ground coffee beans. It has a rich, creamy texture, similar to that of a latte or cappuccino. The main difference between a flat white and a cappuccino lies in the preparation method and proportions of the ingredients.

A cappuccino is also an espresso-based drink but consists of equal parts espresso, steamed milk, and frothed milk. It has a distinct texture and taste due to the presence of more foam compared to a flat white. Cappuccinos are often enjoyed as a breakfast or brunch option.

In summary, a cappuccino is made with an equal balance of espresso, steamed milk, and frothed milk, while a flat white has a greater proportion of coffee and less froth.

Dolphin has understood my questions perfectly well and responded with text that seemingly answers the questions. The answer to the coffee question feels a bit wishy-washy, but I don’t think it’s totally wrong.

You’ll notice with the 2.7b models that it doesn’t take much to get it to veer into making stuff up territory. And personally, I think it’s good to see these smaller models so obviously talking out of their artificial back sides (or “hallucinating” as it is known). It’s funny, and also a good reminder that all of these models are just language models. Making stuff up is kinda what they do - the bigger models just do a more convincing job of sounding like they know what they’re talking about.

When you’re done playing with Dolphin Phi, you should try some larger models and see what you can squeeze out of your computer. I’d recommend trying the Nous Hermes 2 models, which comes in 10.7b and 34b variants. Both should show a massive improvement over Dolphin Phi.

What’s next?

Being able to run LLMs on our own machine provides a unique way to explore the inner workings of language models. And as we’ve seen, it doesn’t require an advanced degree or expensive hardware; just a bit of time, some patience, and a healthy dose of curiosity.

Ollama is a great tool for this, and we’ve only really scratched the surface. In future posts I’ll show how to import models from Hugging Face into Ollama and customise them with our own system prompts to create specialised chat assistants and role play characters; I’ll take a took at multi-modal models that are capable of understanding images; and I’ll be digging in to the Ollama API and learning how we can build apps and services on top of LLMs.

Stay tuned for all of that and more. Don’t forget to subscribe to the RSS Feed and share your own experiences and feedback with me over on X.