12 Ways To Run Local LLMs And Which One Works Best For You

Image: a local llama

Large language models (LLMs) like ChatGPT, Google Bard, and many others can be very helpful. But, if you would like to play with the technology on your own, or if you care about privacy and would like to chat with AI without the data ever leaving your own hardware — running LLMs locally can be a great idea. It’s surprisingly easy to get started, and there are many options available.

Here I’m going to list twelve easy ways to run LLMs locally, and discuss which ones are best for you.

Firstly, there is no single right answer for which tool you should pick. I found that there’s a few aspects of differentiation between these tools, and you can decide which aspect you care about.

Questions to Consider

To find the right tool for you consider these questions:

  • Are you looking to develop an AI application?
  • Are you looking to chat locally with your own documents and have a nice UI?
  • Would you like to get deeper into the intricacies of machine learning and AI?
  • Do you have a Mac, Windows, or Linux machine?
  • How much do you care about inference speed?
  • How much do you care about ease of set up?
  • How much do you care about the breadth of model support?
  • Do you care if the project is open source?
  • Are you using LLMs for roleplay?

Summary Graphic

Here is a summary graphic comparing the different tools. I did some star ratings based on my quick and subjective experience – I hope it’s helpful for your comparison. For ease of reference, I also included the number of GitHub stars the project has, if it is open source:

Local LLM Tools

Ollama

Ollama is an extremely simple, command-line based tool to run LLMs. It’s very easy to get started, and can be used to build AI applications. As of this writing, it only supports Mac and Linux, and not Windows.

Streaming speed is fast, and set up is probably the easiest I’ve seen. You simply download and install from their website. To run any model, you type the following command into your CLI –

ollama run [model name]

You can then start chatting directly within the command line.

You can also create custom models with a Modelfile, which allows you to give the model a system prompt, set temperature, etc.

UIs for Ollama

There are many community UIs built for Ollama. A non-exhaustive list include Bionic GPT, HTML UI, Chatbot UI, Typescript UI, Minimalistic React UI for Ollama Models, Web UI, Ollamac, big-AGI, Cheshire Cat assistant framework, Amica, chatd, Ollama-SwiftUI, and MindMac. Out of all of these, Ollama Webui seem to be the most popular. The interface is very OpenAI-like. They also have a OllamaHub where you can discover different custom Makefiles from the community.

🤗 Transformers

Huggingface is an open source platform and community for deep learning models for language, vision, audio and multimodal. They develop and maintain the transformers library, which simplifies the process of downloading and training state of the art deep learning models.

This is the best library if you have a background in machine learning and neural networks, since it offers seamless integration with popular deep learning libraries like pytorch and tensorflow.

Transformers works on top of pytorch (or alternatively Tensorflow), so you need to install pytorch along with transformers.

Installation of pytorch depends on your hardware – on the pytorch installation page you can pick what hardware you have, and whether you have a Nvidia GPU and CUDA:

What’s cool is that if you have a Macbook with a M1/M2/M3 chip, pytorch also has support for training on Apple Silicon through Apple’s Metal Performance Shaders (MPS).

After installing pytorch, you can install transformers with:

pip install transformers

Running a model only takes a few lines of code. Below is an example to run the Mistral 7B Instruct model:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # if you have a Nvidia GPU and cuda installed

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

In terms of the breadth of model support, Huggingface is probably your best bet thanks to the Hugging Face Hub – you can pretty much find any model out there. Huggingface even maintains different leaderboards ranking LLMs.

Langchain

Langchain is a framework for building AI applications. It integrates different AI libraries together. So, you can run LLMs with langchain using Ollama, or using Huggingface, or using another library. 

The utility of langchain is that it offers templates and components for you to build context-aware applications – meaning that you can give your own document and files to the LLM. This process is called RAG, or retrieval augmented generation. So, langchain is a good candidate if you are building AI applications that needs access to a custom dataset.

llama.cpp

Llama.cpp is the library that inspires most other libraries for running models locally. They are the creator of the .gguf file format, which is now supported by most other libraries.

Llama.cpp implements LLMs in pure C/C++, so that inference is very fast. It supports Mac, Windows, Linux, Docker, and FreeBSD. Apple Silicon is a first class citizen, according to the creator. Also, despite the name, it actually supports many models outside of the llama family, like Mistral 7B, but model selection is a bit limited compared to some of the other libraries.

In terms of set up, you would need to clone the repo and build the project. Then, you would need to run a .gguf model from huggingface. Here is a tiny model to get you started. If you have more time, you can download this Mistral 7B instruct gguf. There are many other options as well.

After that, you can run any model with this command –

./main -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -p "Hello"

This isn’t very convenient, so you can also run models in interactive mode or with an UI. To do this, first start a local server:

./server -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -c 2048

UIs for Llama.cpp

Llama.cpp has its own UI and interactive mode. There are also a lot of community created UIs that builds on llama.cpp.

To run in interactive mode, run

bash examples/server/chat.sh

Llama.cpp also has a nice frontend. For me, it took some time to get the frontend to work, and I finally got it to work by making this change within examples/server/public/completion.js (this may not be necessary in future versions):

  const response = await fetch("http://localhost:8080/completion", {
method: 'POST',
body: JSON.stringify(completionParams),
headers: {
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Accept': 'text/event-stream',
...(params.api_key ? {'Authorization': `Bearer ${params.api_key}`} : {})
},
signal: controller.signal,
});

After making this change, run this command in the public folder:

python3 -m http.server

And you’ll get a nice front-end like this:

There is also a list of community created UIs for llama.cpp on the project’s GitHub page.

textgen-webui

Oobabooga’s. textgen-webui is a very popular frontend for running local LLMs. It is very easy to install, and is designed for roleplay, since you can create your own characters with name, context, and profile picture.

koboldcpp

Koboldcpp is another frontend with native support for roleplay. It builds off of llama.cpp, with a nice UI and API on top of it. Set up is extremely easy, you can follow the instructions on Github. Here is what the UI looks like:

As you can tell from the UI, this is very much designed for role playing and games. You can select scenarios like Dungeon Crawler, or Post Apocalypse, import character cards, and have persistent stories.

GPT4All

GPT4All is a large open source project that can serve many purposes. From the GPT4All landing page you can download a desktop client that lets you run and chat with LLMs through a nice GUI — you can even upload your own documents and files in the GUI and ask questions about them. If you are looking to chat locally with your own documents, this is an out-of-the-box solution.

Here is how the UI looks:

Interestingly, the UI tells me about the inference speed as it is “typing”, which for me was about 7.2 tokens per second on my M1 16GB Macbook Air.

In addition to the GUI, it also offers bindings for Python and NodeJS, and has an integration with langchain, so it is possible to build applications as well.

LM Studio

Similar to GPT4All, LM Studio has a nice GUI for interacting with LLMs. It is the only project on this list that’s not open sourced, but it is free to download.

Here is how the UI looks like:

LM Studio also shows the token generation speed at the bottom – it says 3.57 tok/s for me. This looks quite a bit faster than GPT4All, but I have to say – there is a processing time before any tokens come out at all, which was noticeably long for me. This made the whole experience feel slower.

I wasn’t able to find the ability to upload your own documents and files. There is also no Python / NodeJS bindings to operate this with code.

jan.ai

Jan.ai is a relatively new tool, launched as “an open-source alternative to LM Studio”. Here is what the UI looks like — very clean! It is in dark mode because it’s night time as I’m writing this.

llm

Llm is a CLI tool and Python library for interacting with large language models. It’s very easy to install using pip: pip install llm or homebrew: brew install llm. The default llm used is ChatGPT, and the tool asks you to set your openai key. However, you can also download local models via the llm-gpt4all plugin.

Having an llm as a CLI utility can come in very handy. The creator gives the example of explaining a script:

cat mycode.py | llm -s "Explain this code"

Another fun fact is that llm is developed by Simon Willison, the co-creator of the Django Web Framework.

h2oGPT

h2oGPT is by h2o.ai, a company building distributed machine learning for many years. It has a nice UI, and it’s very easy to upload documents to the chat. Once you get it all set up, it works pretty nicely. This is how the UI looks like:

There are many ways to install h2oGPT – you can install from source and pip install a lot of requirements, or you can download one-click installers for Mac and Windows. The one-click installer is much faster. Although for me, I had to run these two lines of code before installing:

$ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu
$ chmod +x {file-path}/h2ogpt-osx-m1-gpu

This library is not just a GUI – it’s chock full of features actually. It is also a CLI utility, and an inference server for applications. It even supports voice and vision models, not just text. That’s a lot to explore!

localllm

Lastly, I’d like to talk about a very new tool by Google Cloud announced just yesterday, Feb 6, 2024! It’s called localllm. Despite the name, it is designed with the Google Cloud Workstation in mind. But you can also use it locally. If you’d like to run LLMs locally, and migrate to the cloud later, this could be a good tool for you.

I tried running locally following these lines of code:

# Install the tools
pip3 install openai
pip3 install ./llm-tool/.

llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000

python3 querylocal.py

The CLI command (which is also called llm, like the other llm CLI tool) downloads and runs the model on your local port 8000, which you can then work with using an OpenAI compatible API.

More Tools

There are a lot more local LLM tools that I would love to try. I’m keeping a list here from the community, and will try more of them when I have time:

  • Chat with RTX by Nvidia
  • ExLLaMAv2
  • vllm
  • Diffy.ai

Conclusions

Having tried all of these tools, I find they are trying to solve for a few different problems. So, depending on what you are looking to do, here are my conclusions:

  • If you are looking to develop an AI application, and you have a Mac or Linux machine, Ollama is great because it’s very easy to set up, easy to work with, and fast.
  • If you are looking to chat locally with documents, GPT4All is the best out of the box solution that is also easy to set up
  • If you are looking for advanced control and insight into neural networks and machine learning, as well as the widest range of model support, you should try transformers
  • In terms of speed, I think Ollama or llama.cpp are both very fast
  • If you are looking to work with a CLI tool, llm is clean and easy to set up
  • If you want to use Google Cloud, you should look into localllm
  • For native support for roleplay and gaming (adding characters, persistent stories), the best choices are going to be textgen-webui by Oobabooga, and koboldcpp. Alternatively, you can use ollama with custom UIs such as ollama-webui

There are still other tools to run local LLM, and I’m still working on reviewing the rest of them. There are also still more coming out every day. If there’s any you’re particularly interested in seeing, please comment down below.

11 responses to “12 Ways To Run Local LLMs And Which One Works Best For You”

  1. Great work! Thank you.

    Can you please add Nvdia RTX to this? They recently announced their local LM.

    https://www.nvidia.com/en-us/ai-on-rtx/chat-with-rtx-generative-ai/

    1. Thanks Ramesh! I’m working on adding more to this including Nvidia RTX and ExllamaV2. I need to get a windows Nvidia set up for Nvidia RTX (I have a Mac), so still working on getting something set up on the cloud.

  2. Can you please add Dify (https://dify.ai/) to this? And I want to know which local-side LLM has the best integration with RAG, Thanks

    1. Hey Kenny, sure I’ll add Dify. As for local LLM with RAG, autogen by Microsoft is very interesting – you can build with GenAI agents (and use RAG) with no code, and about a month ago they added support for open source models.

  3. Can you please add Dify (https://dify.ai/) to this?

    And I want to know which local-side LLM has the best integration with RAG?

    Thanks a lot.

  4. […] In this post, we’ll give you a way to deploy a simple backend service which wraps around OpenAI’s API, and can then be easily extended to support functionalities like RAG. The same service can also be easily swapped with other LLM providers or open source models. […]

  5. Thank you for your excellent summary and contribution. Could you tell me which tools currently allow uploading files for RAG processing? Is GPT4all an option? In your opinion, which local tool offers the best security? Considering that one might use local LLM tools for inputting sensitive personal or company data, there seems to be a concern that some tools might still require internet connectivity.

    Lastly, have you used Anything LLM: https://useanything.com? I saw someone else using it, and it seems to allow uploading personal knowledge for RAG processing.

    1. Hey JJ! Last time I checked GPT4All and H2o GPT allowed file upload. However, many others are adding this feature so this could have changed. I’ll check out Anything LLM! It looks like a really cool app, loads fast too!

      Moyi

  6. Nice info, I have been running Ollama mostly nice to see thew comparison. I have a 2013 MacPro but I’d like a framework that supports its GPU or perhaps an eGPU NVIDIA card running off of the MacPro, do you have any insight into this?
    Thanks and Keep up the goo work!

    1. Hey Mike,

      Regarding your 2013 Mac Pro, it doesn’t have an Apple Silicon GPU, but it might be using an older NVIDIA GPU. If that’s the case, you can check which CUDA version is compatible with your GPU. Based on that, you can run PyTorch or Hugging Face models that are compatible with your specific CUDA version. However, finding the right compatibility might be challenging.

      If you decide to get a new NVIDIA graphics card, you’ll have many more options available for running local LLMs effectively.

Leave a Reply

Discover more from Mati Labs

Subscribe now to keep reading and get access to the full archive.

Continue reading