llama.cpp is highly regarded for running LLM inference on a CPU-only setting. In this post, let’s run the original llama.cpp and an Ampere optimized version for our Ampere A1 instance on Oracle Linux.
By the end of this post, you will be able to
- build llama.cpp from source yourself, optimized for the Ampere A1 platform
- run an interactive CLI to chat with any supported models (GGUF files)
- serve such models to external apps
- run a prebuilt, optimized Docker container from the offical Ampere repository and serve its optimized models
Let’s start!
Original llama.cpp
According to the official github repo, there are 4 ways to install llama.cpp on your machine:
- Install
llama.cpp
using brew, nix or winget - Run with Docker
- Running the prebuilt docker image per documentation
docker run -v /llama-model:/models ghcr.io/ggml-org/llama.cpp:full --all-in-one "/models/" 7B
results inWARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
- Official doc
lists the only
ghcr.io/ggml-org/
images withlinux\arm64
support are the 3-rocm
tagged versions. This is not true , as the repository hosts only CPU and CUDA images, not ROCm. The images are in fact in AMD’srocm/
namespace in Docker Hub. Anyway, they are not feasible in our scenario because we do not have AMD GPU in our environment, plus therocm/
images are also built foramd64
.
- Running the prebuilt docker image per documentation
- Download pre-built binaries from the releases page
- Prebuilt binaries do not have one for
linux\arm64
.
- Prebuilt binaries do not have one for
- Build from source
Thus I decide to build it myself.
build for Ampere A1
Install all the required tools and dependencies
sudo dnf install -y git cmake make gcc-c++ libcurl-devel
Clone the llama.cpp repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Configure the build with CMake for Arm64
cmake -B build -DGGML_CPU_KLEIDIAI=ON
-B build
generates build files in thebuild
directory-DGGML_CPU_KLEIDIAI=ON
: KleidiAI is a library of optimized microkernels for AI workloads on Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
CMake detects the system compiler, libraries, and hardware capabilities and then produces Makefiles for compilation.
The output shows hardware-specific optimizations for the Arm Ampere platform, enabling Arm64 CPU instructions:
- DOTPROD: hardware-accelerated dot product operations
- FMA: fused multiply-add for faster floating-point math
- FP16 vector arithmetic: reduced memory use with half-precision floats
Compile Llama.cpp
cmake --build build --config Release -j4
--build build
builds in thebuild
directory--config Release
enables compiler optimizations-j4
runs 4 parallel jobs for faster compilation.4
is chosen because my A1 Flex shape has 4 cores.
The build produces Arm64-optimized binaries in under a minute.
To rebuild llama.cpp with other optimization settings later, you can remove the build files by
rm -rf build
, then runcmake -B
andcmake --build
again
After compilation, you’ll find key tools
in the bin
directory:
llama-cli
: main inference executablellama-server
: HTTP server for model inferencellama-quantize
: tool for quantization to reduce memory usage- Additional utilities for model conversion and optimization
llama-cli to interact
We’ll run a small model to test our installation:
./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
This will pull the ggml-org/gemma-3-1b-it-GGUF
model from Hugging Face into cache, and let us interact with it in the CLI.
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
One handy tool is the performance metrics printed by llama.cpp on such as latency, number of tokens and throughput:
llama_perf_sampler_print: sampling time = 0.27 ms / 1 runs ( 0.27 ms per token, 3690.04 tokens per second)
llama_perf_context_print: load time = 645.73 ms
llama_perf_context_print: prompt eval time = 343.21 ms / 12 tokens ( 28.60 ms per token, 34.96 tokens per second)
llama_perf_context_print: eval time = 30703.19 ms / 738 runs ( 41.60 ms per token, 24.04 tokens per second)
llama_perf_context_print: total time = 236849.43 ms / 750 tokens
llama_perf_context_print: graphs reused = 735
llama-server to serve models
cache a new model
Just like the CLI, we can load a HF model directly with the server
llama-server -hf bartowski/Qwen2.5-3B-GGUF:Q4_K_M --port 8081
We get more information like the platform optimization used and the server endpoint.
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | KLEIDIAI = 1 | REPACK = 1 |
.........
main: server is listening on http://127.0.0.1:8081 - starting the main loop
srv update_slots: all slots are idle
Since llama-server
provides an OpenAI-compatible API, we can use cURL in another terminal to send a prompt request to confirm the server is running and responding to API requests:
curl -s -X POST http://localhost:8085/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer llama" \
-d '{
"model": "Qwen2.5-3B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about llama.cpp"}
],
"stream": false
}'
Command Breakdown:
curl -X POST
: Specifies that you are making a POST request, which is required for sending data to the server.http://127.0.0.1:8081/v1/chat/completions
: The full URL of the API endpoint.-H "Content-Type: application/json"
: This header tells the server that the data you are sending is in JSON format.-d '{ ... }'
: This flag provides the JSON data to be sent in the request body. The JSON object contains a messages array, formatted according to the OpenAI chat completion API.
our cURL command returns a valid JSON response with the model’s answer, showing that the llama-server is working correctly.
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Llama.cpp, Proudly written,
Code in hand ."}}],"created":1758243953,"model":"gpt-3.5-turbo","system_fingerprint":"b6517-69ffd891","object":"chat.completion","usage":{"completion_tokens":8,"prompt_tokens":26,"total_tokens":34},"id":"chatcmpl-SVw8VXmeZun342tdL6B1r0a5bhjK6DWY","timings":{"cache_n":0,"prompt_n":26,"prompt_ms":1479.27,"prompt_per_token_ms":56.894999999999996,"prompt_per_second":17.576236927673786,"predicted_n":8,"predicted_ms":609.225,"predicted_per_token_ms":76.153125,"predicted_per_second":13.131437482046863}}
At the same time, we can also see on the server side that it is processing the prompt.
srv params_from_: Chat format: Content-only
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 26
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 26, n_tokens = 26, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 26, n_tokens = 26
slot release: id 0 | task 0 | stop processing: n_past = 33, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 1479.27 ms / 26 tokens ( 56.89 ms per token, 17.58 tokens per second)
eval time = 609.23 ms / 8 tokens ( 76.15 ms per token, 13.13 tokens per second)
total time = 2088.49 ms / 34 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
For programmatic interaction, we can use a Python library like llama-cpp-python
or any library that supports the OpenAI API.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8081/v1",
api_key="llama" # Dummy key for local access
)
response = client.chat.completions.create(
model="Qwen2.5-3B",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about llama.cpp"}
]
)
print(response.choices[0].message.content)
serve a local model
download model
The -hf
flag in llama-cli
or llama-server
used to interact with a Hugging Face hosted model directly, without local storage. A more common practice to serve a model with llama.cpp is to run llama-server with the -m
flag to load a model from a local path on your machine. In this section, we will do exactly that.
Let’s download a HF model
bartowski/Qwen2.5-3B-GGUF:Q4_K_M
locally so we can serve it.
First we create a local directory to store our models
mkdir model
We then download models into this folder. The most recommended way to download files from Hugging Face is by using the official huggingface-cli
command-line tool.
- First, install the necessary Python library:
pip install --upgrade huggingface_hub
- Download the model with the repository ID and the specific filename:
huggingface-cli download <repo_id> <filename>
Or we can simply use wget
with the exact file location. Go to the model page, click Files, pick the quantized model of your choice, then click Copy download link.
wget https://huggingface.co/bartowski/Qwen2.5-3B-GGUF/resolve/main/Qwen2.5-3B-Q4_K_M.gguf -O ./models/Qwen2.5-3B-Q4_K_M.GGUF
run server
Now we can start the server with this model
./build/bin/llama-server -m ./models/Qwen2.5-3B-Q4_K_M.GGUF \
--host 0.0.0.0 \
--port 8081 \
--ctx-size 4096
We can use the same cURL command and Python in the previous section to interact with it.
Ampere optimized llama.cpp container
Since Ampere has an optimized build of llama.cpp in a container, as well as new quantization methods (Q4_K_4 and Q8R16 with 1.5-2x speed on inference), let’s take a look!
- Make sure you have Docker installed on your system.
- Ampere recommend using models in its custom quantization formats
for best performance. Let’s download a model
to our previously created folder
./llama-cpp/models/
.
Let’s try AmpereComputing/qwen-3-4b-gguf
and pick a specific quantized version from the Files tab. For this example, we’ll use Qwen3-4B-Q8R16.gguf
.
wget https://huggingface.co/AmpereComputing/qwen-3-4b-gguf/resolve/main/Qwen3-4B-Q8R16.gguf -O ./llama-cpp/models/Qwen3-4B-Q8R16.gguf
Note: You must use the Ampere optimized container to run these models. Attempting to run it with the original llama.cpp will fail with error
tensor 'token_embd.weight' has invalid ggml type 64 (NONE) gguf_init_from_file_impl: failed to read tensor info
, implying an incompatibility between the GGUF model file and the original llama.cpp. The new GGUF version introduced by Ampere changes new ggml_type codes for tensors that the original llama.cpp does not understand.
- Now we are ready to run the Ampere llama.cpp container.
By default, the Ampere container mirrors the original llama.cpp container to start with llama-server
as an entry point. Thus we can rely on this pre-configured setting and just pass a model for it to start running.
This command will start a container named llama, mount your llama-models directory to /models inside the container, and give you an interactive shell.
docker run -it --rm \
-v /home/opc/llama-cpp/models/:/models \
--name llama \
-p 8081:8081 \
amperecomputingai/llama.cpp:latest \
-m /models/Qwen3-4B-Q8R16.gguf --host 0.0.0.0 --port 8081 --ctx-size 4096
Command Breakdown
-it
: Keeps the container’s standard input open and allocates a pseudo-TTY, allowing you to see the server’s log output.--rm
: Automatically removes the container when it exits, keeping your system clean.-v ./home/opc/llama-cpp/models/:/models
: Map the model directory created previously to a mount in the container./home/opc/llama-cpp/models/
: The path to the directory on your host machine where your model is stored. Docker requires absolute paths for volume mounts. You can get the full, absolute path of a directory withpwd
:/models
: The path inside the container wherellama-server
will find the model.
-p 8081:8081
: Maps port 8081 on host machine to port 8081 inside the container. This allows you to access the server outside the container.amperecomputingai/llama.cpp:latest
: The official Docker image to pull and run. Make sure you only pull this tagged version not the3.2.1-ampereone
image, which is incompatible with the A1 Flex shape.-m /models/Qwen3-4B-Q8R16.gguf
: -m flag and other server arguments are passed directly to the image’s entrypoint, which is default to llama-server. We first pass a model usingmodels/
which corresponds to the path inside the container that we created with the-v
flag.Qwen3-4B-Q8R16.gguf
is the file we downloaded in step 2.--host 0.0.0.0 --port 8081 --ctx-size 4096
: These are the standardllama-server
arguments. Note that we must use--host 0.0.0.0
to ensure the server listens on all network interfaces, allowing it to be accessible via the mapped port. We also sets the context size. and the port the server will listen on.
interact with server
We can use the same cURL command and Python in the previous section to interact with it.
For a more user-friendly experience, we can use a web-based chat interface that connects to the llama-server API. Open WebUI is a popular choice for this.
- Run the Open WebUI Docker container: You will need to run a separate Docker container for the UI and configure it to connect to your llama-server.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
--name open-webui --restart unless-stopped \
ghcr.io/open-webui/open-webui:main
Configure the connection: Once the container is running, open your web browser and navigate to http://localhost:3000. In the settings or “Connections” section, add a new OpenAI-compatible connection.
URL: http://host.docker.internal:8081 (The host.docker.internal is a special Docker DNS name that resolves to the host machine's IP, allowing the UI container to reach the server container.) API Key: Leave this field empty.
Start chatting: After saving the connection, you should be able to select your model and interact with it through the chat interface.
running shell in container
You can also override the image’s default entry point by running a shell instead:
docker run --privileged=true --name llama2 --entrypoint /bin/bash -it -v /home/opc/llama-cpp/models/:/models amperecomputingai/llama.cpp:latest
The shell will launch into the /llm
folder. From here, you can run this to launch an interactive CLI to chat
./llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
Switching models
Unlike Ollama that can swap models in its local storage, the llama-server
binary loads a single model file when it starts and does not have a built-in command to dynamically swap models while running. To change the model being served by llama-server
, you must restart the server with the new model path specified in the -m
flag.
If your goal is to have multiple models accessible at the same time, you have two primary options:
Run a separate server for each model: Start a new
llama-server
container for each model, but make sure each one uses a different port to avoid conflicts. For example,8082
,8083
, etc.Use a model management proxy: Tools like
Llama-Swap
orFlexLLama
can act as a proxy. You configure them with a list of your models and their paths. These proxies then listen on a single port and automatically start and stop the appropriatellama-server
process based on themodel
parameter you specify in your API request.
Comments