Running Any GGUF Model from Hugging Face with Ollama

Introduction

The latest Ollama update makes it easier than ever to run quantized GGUF models directly from Hugging Face on your local machine. With a single command, you can bypass previous limitations, no longer needing a separate model on the Ollama Model Hub.

Step-by-Step Guide

1. Install Ollama

Download and install Ollama on your computer. Once installed, the ollama command will be accessible from your command line interface (CLI).

2. Select a Model from Hugging Face

Go to the Hugging Face Model Hub and choose a model. For best performance, especially on local setups, consider selecting a smaller model.

3. Copy the Model Link

Find the model’s URL, which includes the username and model name.

4. Run the Model with Ollama

Open your CLI and use this command to run the model directly

ollama run hf.co/<username>/<model_name>:latest

#Example
ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
>>> hi
How are you today? Is there something I can help you with or would you like to chat?

Replace <username> and <model_name> with the respective values from the link.

5. Specify a Different Version (Optional)

If you prefer a different quantized version, like Q8, add the version after the model name in your command.

6. Run and Download

Once you run the command, Ollama will download the specified model from Hugging Face, making it available for local use.

Why This Update Matters

The ability to seamlessly run Hugging Face models locally with Ollama expands the flexibility for model experimentation and deployment. No additional steps or setups on the Ollama Model Hub are required, which saves time and simplifies the process for developers.

Connecting to the Ollama API from Other Devices

To connect other devices to your Ollama instance and use the API, you’ll need to configure two environment variables that control the network interface Ollama listens on and the allowed origins for incoming requests. These steps apply to both Windows and Linux.

1. Setting Environment Variables

On Linux: Use the export command to set environment variables:
- OLLAMA_HOST: Set this variable to "0.0.0.0" to make Ollama listen on all network interfaces, allowing local network connections. For restricted access, specify a particular IP address instead.
- OLLAMA_ORIGINS: Set to "*" for unrestricted access from any origin, ideal for development. In production, restrict this to specific origins to enhance security.

2. Restart Ollama

Restart the Ollama service to apply these changes. This can usually be done through the Ollama interface or with the relevant command for your OS.

After restarting, other devices on your network can connect to the API using the IP address or hostname of the machine running Ollama, along with the port number (default is 11434).

Example of a Chat Completion Endpoint accessing GGUF model

With Ollama configured to accept connections from other devices, you can interact with your models using the chat completion endpoint.

Endpoint: POST /api/chat
Base URL: The default URL is http://localhost:11434

Example request using curl:

curl http://localhost:11434/api/chat \
      -d '{
       "model": "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF",
        "messages": [
       {
                "role": "user",
                "content": "What is the weather like today?"
      }
       ],"stream": false
 }'
 
 {"model":"hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF","created_at":"2024-11-01T15:48:46.168058Z","message":{"role":"assistant","content":"However, I'm a large language model, I don't have real-time access to current weather conditions. But I can suggest some options for you to find out the current weather:\n\n1. **Check online weather websites**: You can check websites like AccuWeather, Weather.com, or the National Weather Service (NWS) for current and forecasted weather conditions.\n2. **Use a search engine**: Type \"weather today\" or \"current weather in [your city/state]\" to find the latest information on the web.\n3. **Check your smartphone's weather app**: Many smartphones come with built-in weather apps that provide real-time weather updates.\n\nIf you'd like, I can give you some general information about typical weather conditions in different parts of the world. Just let me know!"},"done_reason":"stop","done":true,"total_duration":5455977456,"load_duration":23511786,"prompt_eval_count":17,"prompt_eval_duration":37940000,"eval_count":159,"eval_duration":5393094000}⏎

In this example, a POST request is sent to the /api/chat endpoint, with JSON data specifying the model ("hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF") and a user message. The response will contain the model’s generated output.