Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere

In my last blog post , I reviewed various options for local LLM inferencing on a ARM-based Ampere A1 Compute instance with 4-core, 24GB RAM running Linux. I have concluded that the Ampere optimized Ollama strikes a good balance of performance and ease of use in our scenario.

In this post, I will discuss the tooling - CLI tools, cURL for API interactions, and programmatic access via Python bindings. We’ll run LLMs locally with Ollama (both the Ampere optimized version and original) and Docker Model Runner .

Docker Model Runner

Install

Update system and install the Docker Model Runner plugin:

sudo dnf update
sudo dnf install docker-model-plugin

For Ubuntu, just replace dnf with apt-get

We can verify the installation with docker model version

The docker model command is now available alongside regular Docker commands like docker run, docker ps, etc.

Deploy an AI Model

Now let’s deploy an AI model from the Hub with a single command docker model run:

docker model run ai/qwen2.5:3B-Q4_K_M

This command automatically performs several actions:

Downloads the qwen2.5:3B model with Q4_K_M quantization
Starts an inference engine (llama.cpp)
Serves the model on port 12434, accessible by REST API
Launches an interactive chat CLI To exit the chat, type /bye.

Some useful commands (that bare similarity to docker)

docker model ps shows all running models, similar to how docker ps shows running containers. Now you should see ai/qwen2.5:3B-Q4_K_M model listed and running.
docker model ls: list models pulled to your local environment with more information. For example

MODEL NAME                               PARAMETERS  QUANTIZATION    ARCHITECTURE  MODEL ID      CREATED       SIZE
ai/qwen2.5:3B-Q4_K_M                     3.09 B      IQ2_XXS/Q4_K_M  qwen2         41045df49cc0  5 months ago  1.79 GiB
ai/llama3.2:3B-Q4_0                      3.21 B      Q4_0            llama         da80a841836d  4 months ago  1.78 GiB

docker model rm <model>: Remove a local model
docker model inspect <model>: Display detailed information on one model. For example

$ docker model inspect ai/qwen2.5:3B-Q4_K_M
{
    "id": "sha256:41045df49cc0d72a4f8c15eb6b21464d3e6f4dc2899fe8ccd9e5b72bdf4d0bf9",
    "tags": [
        "ai/qwen2.5:3B-Q4_K_M"
    ],
    "created": 1744119140,
    "config": {
        "format": "gguf",
        "quantization": "IQ2_XXS/Q4_K_M",
        "parameters": "3.09 B",
        "architecture": "qwen2",
        "size": "1.79 GiB"
    }
}

Inferencing

Now let’s move forward from chatting in the CLI to accessing via REST API.

Construct the base URL

Model Runner’s OpenAI‑compatible endpoint has this format:

http://<host>:<port>/engines/<engine-name>/v1/...

We will fill in each part in the next steps.

<host>:<port>

Find the container that was started by docker model run with docker ps. Note the followings:
- <container-id>
- a PORTS column, e.g.: 0.0.0.0:12434->8000/tcp. 12434 is the port the container uses in the host.
Check docker logs <container-id>. In the first few lines, Model Runner prints something like: Listening on http://0.0.0.0:12345. That’s your base URL — replace 0.0.0.0 with localhost if you’re calling from the same machine.
The endpoint is http://localhost:12434/

If you’re accessing Model Runner from another container, use http://model-runner.docker.internal/ as the endpoint.

<engine-name>
Model Runner always scopes endpoints by <engine-name>. Currently, it only supports llama.cpp for most GGUF‑format models. There is talk for future support of other engines such as vllm for Hugging Face Transformers/PyTorch models.
If you’re curious, you can check docker logs <container-id> to examine a similar line Starting engine: llama.cpp or msg="Loading llama.cpp backend runner"" which confirm that the engine name is llama.cpp.

Now we have the complete base url: http://localhost:12434/engines/llama.cpp/v1/, with the /v1/... part following the same schema as OpenAI’s API.

cURL

Now that we have figured out how to connect, let’s do a simple curl to list all models.

curl http://localhost:12434/engines/llama.cpp/v1/models

You would get JSON back with the model(s) you’ve loaded

{
  "object":"list",
  "data":[
    {
      "id":"ai/qwen2.5:3B-Q4_K_M",
      "object":"model",
      "created":1742816981,
      "owned_by":"docker"
    },
    { 
      "id":"ai/llama3.2:3B-Q4_0",
      "object":"model",
      "created":1745777589,
      "owned_by":"docker"
    }
  ]
}

We can then generate some text!

  curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
    -H "Content-Type: application/json"\
    -d '{
          "model": "ai/qwen2.5:3B-Q4_K_M",
          "messages": [
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": "Write a haiku about Docker." }
          ]
        }'

The /chat/completions endpoint follows the same schema as OpenAI’s API, so you can reuse existing client code. You can also hit /completions for plain text completion or /embeddings for vector embeddings.

Python

First, install the OpenAI Python client: pip install openai. The api_key can be any string as Model Runner doesn’t validate it.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:12434/engines/llama.cpp/v1",
    api_key="not-needed" 
)

resp = client.chat.completions.create(
    model="ai/qwen2.5:3B-Q4_K_M",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about Docker."}
    ]
)
print(resp.choices[0].message.content)

Run python app.py. It will launch Docker Model Runner with our specified model, take our prompt and output the completion generated by the model.

Ampere optimized Ollama

Ampere has tweaked the original Ollama engine for performance improvements. We’ll see if the claim is true in our next post.

First, let’s get it running.

We will use the qwen2.5:3b model with the same quantization as the one from Docker hub.

We will run it as a container

docker run --privileged=true --name ollama -p 11434:11434 amperecomputingai/ollama:latest

In a separate shell: docker exec -it ollama bash, then ollama run qwen2.5:3b to pull and run the model. The model is stored in ~/.ollama/models.

Similar to Docker Model Runner, Ollama exposes a simple REST API (e.g., at http://localhost:11434) that’s OpenAI-compatible. You can query it directly with Python libraries like openai or requests. We can reuse the above code and just change the base URL to base_url="http://localhost:11434/v1"

There’s also a a Python binding for programmatic access.

Original Ollama

For comparison, we can also run the original Ollama engine as a container

docker run -d --privileged=true -v ollama:/root/.ollama -p 11400:11434 --name ollama2 ollama/ollama
docker exec -it ollama2 ollama run qwen2.5:3b

To run it as a standalone app for ARM64 :

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Run a model server: ollama serve (background) or ollama run <model> for interactive testing.
For web app serving: Run Ollama as a daemon (systemd service) for always-on availability. Your app can handle routing, authentication, etc., while Ollama manages inference.
Customization: Use OLLAMA_NUM_PARALLEL env var to limit concurrent requests (e.g., 1-2 for A1 Flex’s 4 cores).

A Python app to chat with both services

Now that we know the endpoints for both Ampere optimized Ollama and Docker Model Runner, we can build some interactive functionalities into our previous Python script so we can pick a model to chat with. We will take advantage of the OpenAI API compatibility of both service to streamline and reuse code.

When you run the following script, it will:

query BOTH Docker Model Runner and Ampere optimized Ollama for their available models using OpenAI’s client.models.list(). This basically exposes a GET /v1/models endpoint and returns the list of loaded models in OpenAI‑style JSON.
Display each model ID with a number and its client.
Let you choose one, and use that model for the chat request.
Prompt you for your message, which is inserted into the "role": "user" message.
Send the request and prints the reply.
Pick a different model without restarting.

from openai import OpenAI

# --- Config ---
# Adjust MODEL_RUNNER_BASE depending on where you run it:
# From another container: use `host.docker.internal`
# On host: use `localhost`
MODEL_RUNNER_BASE = "http://localhost:12434/engines/llama.cpp/v1"
OLLAMA_OPENAI_BASE = "http://localhost:11434/v1"  # Ollama in OpenAI mode
API_KEY = "not-needed"

# 1. Create OpenAI clients for both backends
mr_client = OpenAI(base_url=MODEL_RUNNER_BASE, api_key=API_KEY)
ollama_client = OpenAI(base_url=OLLAMA_OPENAI_BASE, api_key=API_KEY)

def get_models(client, source_name):
    try:
        resp = client.models.list()
        return [{"id": m.id, "source": source_name, "client": client} for m in resp.data]
    except Exception as e:
        print(f"Error fetching models from {source_name}: {e}")
        return []

# 2. Fetch available models from both sources
models = get_models(mr_client, "model_runner") + get_models(ollama_client, "ollama")

if not models:
    print("No models found. Make sure Model Runner and/or Ollama are running in OpenAI mode.")
    exit(1)

while True:
    # 3. Show combined list and let user choose
    print("\nAvailable models:")
    for idx, m in enumerate(models, start=1):
        print(f"{idx}. {m['id']} ({m['source']})")

    choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
    if choice.lower() == 'q':
        break
    try:
        model_idx = int(choice) - 1
        if model_idx < 0 or model_idx >= len(models):
            raise ValueError
    except ValueError:
        print("Invalid selection.")
        continue

    selected = models[model_idx]
    print(f"Using model: {selected['id']} from {selected['source']}")

    while True:
        # 4. Ask for user prompt
        user_prompt = input("Enter your prompt (or 'back' to choose another model): ").strip()
        if user_prompt.lower() == 'back':
            break
        if not user_prompt:
            continue

        try:
            # 5. Send the request
            resp = selected["client"].chat.completions.create(
                model=selected["id"],
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": user_prompt}
                ]
            )

            # 6. Print the model's reply
            print("\nModel reply:\n")
            print(resp.choices[0].message.content)
        except Exception as e:
            print(f"Error querying {selected['source']}: {e}")

We can run it with python app.py. Now you can:

Pick model 1, send multiple prompts
Type back to return to model selection
Pick another without restarting the container

My result looks like this

Available models:
1. ai/qwen2.5:3B-Q4_K_M (model_runner)
2. ai/llama3.2:3B-Q4_0 (model_runner)
3. qwen2.5:3b (ollama)
4. hf.co/AmpereComputing/llama-3.2-3b-instruct-gguf:Llama-3.2-3B-Instruct-Q8R16.gguf (ollama)
Select a model [1-4] or 'q' to quit: 1
Using model: ai/qwen2.5:3B-Q4_K_M from model_runner
Enter your prompt (or 'back' to choose another model): write a haiku for local LLM

Model reply:

Local LLM whispers,
Infinite knowledge flows through,
Wordsmith of words.
Enter your prompt (or 'back' to choose another model):

Improve multi-model serving in Model Runner

Even though client.models.list() will show you multiple models, Docker Model Runner doesn’t truly run them all at once. It loads one into memory at a time, and if you call a different one, it needs to perform these tasks sequentially:

Unload the current model from the llama.cpp backend
Load the new one from disk
Initialize it in memory

Depending on model size and hardware, this can take a long time or appear to hang.

From Docker’s own issue tracker, true concurrent multi‑model serving isn’t fully supported yet. Right now, switching models mid‑session can cause long cold‑start delays or even timeouts if the backend doesn’t handle the swap cleanly.

To avoid the delay, we have a few options

Restart Model Runner with the desired model: If you only need one model at a time, stop the container and start it again with the new model before running your client.
Run each model in own Model Runner container (multi‑container mapping): In this approach, we will not swap models inside a single Model Runner container. Instead, we’ll run each model in its own dedicated Model Runner container on a different port, and have the Python clien then map each model name to its own base_url.

For example, run these in different ports

docker run -d 
  --name llama3.2 
  -p 12434:12434 
  docker/model-runner:latest ai/llama3.2:3B-Q4_0

docker run -d
  --name qwen3b
  -p 12435:12434
  -v docker-model-runner-models:/models
  docker/model-runner:latest ai/qwen2.5:3B-Q4_K_M

Now each model is already loaded in its own container. And in the Python client below, we hard code a mapping of model IDs to their container base URLs. When we pick a model, it automatically points to the correct container/port. You can switch instantly between models without restarting anything or the cold‑start delay. That’ll be the smoothest experience until Docker ships true multi‑model support.

# Map model IDs to their dedicated container base URLs
MODEL_ENDPOINTS = {
    "ai/llama3.2:3B-Q4_0": "http://localhost:12434/engines/llama.cpp/v1",
    "ai/qwen2.5:3B-Q4_K_M": "http://localhost:12435/engines/llama.cpp/v1"
}

models = list(MODEL_ENDPOINTS.keys())

while True:
    # Show models and let user choose
    print("\nAvailable models:")
    for idx, model_id in enumerate(models, start=1):
        print(f"{idx}. {model_id}")

    choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
    if choice.lower() == 'q':
        break
    try:
        model_idx = int(choice) - 1
        if model_idx < 0 or model_idx >= len(models):
            raise ValueError
    except ValueError:
        print("Invalid selection.")
        continue

    selected_model = models[model_idx]
    base_url = MODEL_ENDPOINTS[selected_model]
    client = OpenAI(base_url=base_url, api_key=API_KEY)

    print(f"Using model: {selected_model}")

Monitor resource usage

When we run your Python client in one terminal, we can monitor CPU/RAM usage from another terminal using top or htop.

If we are using Docker Model Runner to manage models, or the Ampere optimized Ollama container, we can also monitor their usage using docker stats or filter with the container’s name by docker stats qwen3b in another terminal too.