In my last blog post , I reviewed various options for local LLM inferencing on a ARM-based Ampere A1 Compute instance with 4-core, 24GB RAM running Linux. I have concluded that the Ampere optimized Ollama strikes a good balance of performance and ease of use in our scenario.
In this post, I will discuss the tooling - CLI tools, cURL for API interactions, and programmatic access via Python bindings. We’ll run LLMs locally with Ollama (both the Ampere optimized version and original) and Docker Model Runner .
Docker Model Runner
Install
Update system and install the Docker Model Runner plugin:
sudo dnf update
sudo dnf install docker-model-plugin
For Ubuntu, just replace
dnf
withapt-get
We can verify the installation with docker model version
The docker model
command is now available alongside regular Docker commands like docker run
, docker ps
, etc.
Deploy an AI Model
Now let’s deploy an AI model from the Hub
with a single command docker model run
:
docker model run ai/qwen2.5:3B-Q4_K_M
This command automatically performs several actions:
- Downloads the qwen2.5:3B model with Q4_K_M quantization
- Starts an inference engine (llama.cpp)
- Serves the model on port 12434, accessible by REST API
- Launches an interactive chat CLI
To exit the chat, type
/bye
.
Some useful commands (that bare similarity to docker
)
docker model ps
shows all running models, similar to howdocker ps
shows running containers. Now you should seeai/qwen2.5:3B-Q4_K_M
model listed and running.docker model ls
: list models pulled to your local environment with more information. For example
MODEL NAME PARAMETERS QUANTIZATION ARCHITECTURE MODEL ID CREATED SIZE
ai/qwen2.5:3B-Q4_K_M 3.09 B IQ2_XXS/Q4_K_M qwen2 41045df49cc0 5 months ago 1.79 GiB
ai/llama3.2:3B-Q4_0 3.21 B Q4_0 llama da80a841836d 4 months ago 1.78 GiB
docker model rm <model>
: Remove a local modeldocker model inspect <model>
: Display detailed information on one model. For example
$ docker model inspect ai/qwen2.5:3B-Q4_K_M
{
"id": "sha256:41045df49cc0d72a4f8c15eb6b21464d3e6f4dc2899fe8ccd9e5b72bdf4d0bf9",
"tags": [
"ai/qwen2.5:3B-Q4_K_M"
],
"created": 1744119140,
"config": {
"format": "gguf",
"quantization": "IQ2_XXS/Q4_K_M",
"parameters": "3.09 B",
"architecture": "qwen2",
"size": "1.79 GiB"
}
}
Inferencing
Now let’s move forward from chatting in the CLI to accessing via REST API.
Construct the base URL
Model Runner’s OpenAI‑compatible endpoint has this format:
http://<host>:<port>/engines/<engine-name>/v1/...
We will fill in each part in the next steps.
<host>
:<port>
Find the container that was started by
docker model run
withdocker ps
. Note the followings:<container-id>
- a PORTS column, e.g.:
0.0.0.0:12434->8000/tcp
.12434
is theport
the container uses in the host.
Check
docker logs <container-id>
. In the first few lines, Model Runner prints something like:Listening on http://0.0.0.0:12345
. That’s your base URL — replace0.0.0.0
withlocalhost
if you’re calling from the same machine.The endpoint is
http://localhost:12434/
If you’re accessing Model Runner from another container, use
http://model-runner.docker.internal/
as the endpoint.
<engine-name>
Model Runner always scopes endpoints by
<engine-name>
. Currently, it only supportsllama.cpp
for most GGUF‑format models. There is talk for future support of other engines such asvllm
for Hugging Face Transformers/PyTorch models.If you’re curious, you can check
docker logs <container-id>
to examine a similar lineStarting engine: llama.cpp
ormsg="Loading llama.cpp backend runner""
which confirm that the engine name isllama.cpp
.
Now we have the complete base url: http://localhost:12434/engines/llama.cpp/v1/
, with the /v1/...
part following the same schema as OpenAI’s API.
cURL
Now that we have figured out how to connect, let’s do a simple curl
to list all models.
curl http://localhost:12434/engines/llama.cpp/v1/models
You would get JSON back with the model(s) you’ve loaded
{
"object":"list",
"data":[
{
"id":"ai/qwen2.5:3B-Q4_K_M",
"object":"model",
"created":1742816981,
"owned_by":"docker"
},
{
"id":"ai/llama3.2:3B-Q4_0",
"object":"model",
"created":1745777589,
"owned_by":"docker"
}
]
}
We can then generate some text!
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "ai/qwen2.5:3B-Q4_K_M",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Write a haiku about Docker." }
]
}'
The
/chat/completions
endpoint follows the same schema as OpenAI’s API, so you can reuse existing client code. You can also hit/completions
for plain text completion or/embeddings
for vector embeddings.
Python
First, install the OpenAI Python client: pip install openai
. The api_key
can be any string as Model Runner doesn’t validate it.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:12434/engines/llama.cpp/v1",
api_key="not-needed"
)
resp = client.chat.completions.create(
model="ai/qwen2.5:3B-Q4_K_M",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Docker."}
]
)
print(resp.choices[0].message.content)
Run python app.py
. It will launch Docker Model Runner with our specified model, take our prompt and output the completion generated by the model.
Ampere optimized Ollama
Ampere has tweaked the original Ollama engine for performance improvements. We’ll see if the claim is true in our next post.
First, let’s get it running.
We will use the qwen2.5:3b model with the same quantization as the one from Docker hub.
We will run it as a container
docker run --privileged=true --name ollama -p 11434:11434 amperecomputingai/ollama:latest
In a separate shell: docker exec -it ollama bash
, then ollama run qwen2.5:3b
to pull and run the model. The model is stored in ~/.ollama/models
.
Similar to Docker Model Runner, Ollama exposes a simple REST API (e.g., at http://localhost:11434) that’s OpenAI-compatible. You can query it directly with Python libraries like openai
or requests
. We can reuse the above code and just change the base URL to
base_url="http://localhost:11434/v1"
There’s also a a Python binding for programmatic access.
Original Ollama
For comparison, we can also run the original Ollama engine as a container
docker run -d --privileged=true -v ollama:/root/.ollama -p 11400:11434 --name ollama2 ollama/ollama
docker exec -it ollama2 ollama run qwen2.5:3b
To run it as a standalone app for ARM64 :
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
- Run a model server:
ollama serve
(background) orollama run <model>
for interactive testing. - For web app serving: Run Ollama as a daemon (systemd service) for always-on availability. Your app can handle routing, authentication, etc., while Ollama manages inference.
- Customization: Use
OLLAMA_NUM_PARALLEL
env var to limit concurrent requests (e.g., 1-2 for A1 Flex’s 4 cores).
A Python app to chat with both services
Now that we know the endpoints for both Ampere optimized Ollama and Docker Model Runner, we can build some interactive functionalities into our previous Python script so we can pick a model to chat with. We will take advantage of the OpenAI API compatibility of both service to streamline and reuse code.
When you run the following script, it will:
- query BOTH Docker Model Runner and Ampere optimized Ollama for their available models using OpenAI’s
client.models.list()
. This basically exposes aGET /v1/models
endpoint and returns the list of loaded models in OpenAI‑style JSON. - Display each model ID with a number and its client.
- Let you choose one, and use that model for the chat request.
- Prompt you for your message, which is inserted into the
"role": "user"
message. - Send the request and prints the reply.
- Pick a different model without restarting.
from openai import OpenAI
# --- Config ---
# Adjust MODEL_RUNNER_BASE depending on where you run it:
# From another container: use `host.docker.internal`
# On host: use `localhost`
MODEL_RUNNER_BASE = "http://localhost:12434/engines/llama.cpp/v1"
OLLAMA_OPENAI_BASE = "http://localhost:11434/v1" # Ollama in OpenAI mode
API_KEY = "not-needed"
# 1. Create OpenAI clients for both backends
mr_client = OpenAI(base_url=MODEL_RUNNER_BASE, api_key=API_KEY)
ollama_client = OpenAI(base_url=OLLAMA_OPENAI_BASE, api_key=API_KEY)
def get_models(client, source_name):
try:
resp = client.models.list()
return [{"id": m.id, "source": source_name, "client": client} for m in resp.data]
except Exception as e:
print(f"Error fetching models from {source_name}: {e}")
return []
# 2. Fetch available models from both sources
models = get_models(mr_client, "model_runner") + get_models(ollama_client, "ollama")
if not models:
print("No models found. Make sure Model Runner and/or Ollama are running in OpenAI mode.")
exit(1)
while True:
# 3. Show combined list and let user choose
print("\nAvailable models:")
for idx, m in enumerate(models, start=1):
print(f"{idx}. {m['id']} ({m['source']})")
choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
if choice.lower() == 'q':
break
try:
model_idx = int(choice) - 1
if model_idx < 0 or model_idx >= len(models):
raise ValueError
except ValueError:
print("Invalid selection.")
continue
selected = models[model_idx]
print(f"Using model: {selected['id']} from {selected['source']}")
while True:
# 4. Ask for user prompt
user_prompt = input("Enter your prompt (or 'back' to choose another model): ").strip()
if user_prompt.lower() == 'back':
break
if not user_prompt:
continue
try:
# 5. Send the request
resp = selected["client"].chat.completions.create(
model=selected["id"],
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_prompt}
]
)
# 6. Print the model's reply
print("\nModel reply:\n")
print(resp.choices[0].message.content)
except Exception as e:
print(f"Error querying {selected['source']}: {e}")
We can run it with python app.py
. Now you can:
- Pick model 1, send multiple prompts
- Type
back
to return to model selection - Pick another without restarting the container
My result looks like this
Available models:
1. ai/qwen2.5:3B-Q4_K_M (model_runner)
2. ai/llama3.2:3B-Q4_0 (model_runner)
3. qwen2.5:3b (ollama)
4. hf.co/AmpereComputing/llama-3.2-3b-instruct-gguf:Llama-3.2-3B-Instruct-Q8R16.gguf (ollama)
Select a model [1-4] or 'q' to quit: 1
Using model: ai/qwen2.5:3B-Q4_K_M from model_runner
Enter your prompt (or 'back' to choose another model): write a haiku for local LLM
Model reply:
Local LLM whispers,
Infinite knowledge flows through,
Wordsmith of words.
Enter your prompt (or 'back' to choose another model):
Improve multi-model serving in Model Runner
Even though client.models.list()
will show you multiple models, Docker Model Runner doesn’t truly run them all at once. It loads one into memory at a time, and if you call a different one, it needs to perform these tasks sequentially:
- Unload the current model from the llama.cpp backend
- Load the new one from disk
- Initialize it in memory
- Depending on model size and hardware, this can take a long time or appear to hang.
From Docker’s own issue tracker, true concurrent multi‑model serving isn’t fully supported yet. Right now, switching models mid‑session can cause long cold‑start delays or even timeouts if the backend doesn’t handle the swap cleanly.
To avoid the delay, we have a few options
- Restart Model Runner with the desired model: If you only need one model at a time, stop the container and start it again with the new model before running your client.
- Run each model in own Model Runner container (multi‑container mapping): In this approach, we will not swap models inside a single Model Runner container. Instead, we’ll run each model in its own dedicated Model Runner container on a different port, and have the Python clien then map each model name to its own
base_url
.
For example, run these in different ports
docker run -d
--name llama3.2
-p 12434:12434
docker/model-runner:latest ai/llama3.2:3B-Q4_0
docker run -d
--name qwen3b
-p 12435:12434
-v docker-model-runner-models:/models
docker/model-runner:latest ai/qwen2.5:3B-Q4_K_M
Now each model is already loaded in its own container. And in the Python client below, we hard code a mapping of model IDs to their container base URLs. When we pick a model, it automatically points to the correct container/port. You can switch instantly between models without restarting anything or the cold‑start delay. That’ll be the smoothest experience until Docker ships true multi‑model support.
# Map model IDs to their dedicated container base URLs
MODEL_ENDPOINTS = {
"ai/llama3.2:3B-Q4_0": "http://localhost:12434/engines/llama.cpp/v1",
"ai/qwen2.5:3B-Q4_K_M": "http://localhost:12435/engines/llama.cpp/v1"
}
models = list(MODEL_ENDPOINTS.keys())
while True:
# Show models and let user choose
print("\nAvailable models:")
for idx, model_id in enumerate(models, start=1):
print(f"{idx}. {model_id}")
choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
if choice.lower() == 'q':
break
try:
model_idx = int(choice) - 1
if model_idx < 0 or model_idx >= len(models):
raise ValueError
except ValueError:
print("Invalid selection.")
continue
selected_model = models[model_idx]
base_url = MODEL_ENDPOINTS[selected_model]
client = OpenAI(base_url=base_url, api_key=API_KEY)
print(f"Using model: {selected_model}")
Monitor resource usage
When we run your Python client in one terminal, we can monitor CPU/RAM usage from another terminal using top
or htop
.
If we are using Docker Model Runner to manage models, or the Ampere optimized Ollama container, we can also monitor their usage using docker stats
or filter with the container’s name by docker stats qwen3b
in another terminal too.
Comments