Large Language Models (LLMs) have become increasingly accessible for local deployment, even on modest hardware. Local inference offers privacy, cost-effectiveness, and offline capabilities that cannot be found with 3rd party offerings.
Oracle Cloud Infrastructure’s Ampere A1 Flex compute shape with 4 Ampere A1 cores and 24GB of RAM has been a popular Compute choice in the OCI free tier. This setup uses ARM-based Ampere Altra processors, which are efficient for CPU-bound tasks but lack the parallel processing power of GPUs. The available memory also restricts the choice of LLMs we can run. Since we’re working on a headless Linux VM with no GUI desktop (unless you install VNC or other tools), we also need solutions that are flexible, efficient, and work entirely through the command line or an API.
In this post, I’ll outline the key options for running LLMs locally in this environment, comparing their pros and cons with an emphasis on meeting our system constraints, while being performant and easy to use. I’ll look at the largest models you can run with reasonable performance, and recommend the best choice.
In the next post
, I will discuss the tooling - CLI tools, cURL for API interactions, and programmatic access via Python bindings. Finally, I will benchmark Docker Model Runner, Ollama and Ampere optimized Ollama, using the same qwen2.5:3B
model with Q4_K_M quantization.
Let’s start!
Local LLM inferencing on Ampere A1 Flex
Several open-source tools allow you to deploy and inference LLMs on CPU-only environments like the ARM Ampere Linux instance. I will highlight choices below based on their compatibility with ARM Linux, CLI- and API-driven workflow, efficiency on limited resources, and ease of use.
Ampere optimized Ollama
Ollama has gained immense popularity for its simplicity and ease of use. It is a user-friendly wrapper around llama.cpp with a lightweight tool for model downloading, quantization, running, and managing LLMs locally. It offers both a CLI and OpenAI-compatible API.
For Ampere instances, the Ampere optimized Ollama container is a version specifically compiled and tuned to leverage the architectural strengths of Ampere Altra CPUs, resulting in significant performance gains.
How to use: Run the optimized Docker container, and from there you can pull and run models with a single command (
ollama run qwen2.5:3B
) which gives you an interactive prompt to start chatting. It also automatically exposes an OpenAI-compatible API on port11434
and a Python binding for programmatic access. For example,curl http://localhost:11434/api/generate -d '{"model": "llama3", "prompt": "Write a haiku about Ollama"}'
If you prefer to work with the original flavor of Ollama on Ampere, you can either do a direct install , or run the original Docker container.
Install the ARM64 binary via curl
curl -L https://ollama.com/download/ollama-linux-arm64.tgz -o ollama-linux-arm64.tgz sudo tar -C /usr -xzf ollama-linux-arm64.tgz
Run as a Docker container .
Pros:
- Unmatched simplicity: Easiest setup and model management with a single command to download and run a model.
- Vibrant community: A large and active community contributes to a rich library of available models and provides ample support.
- Performant for simple use cases For a single user interacting with a model, the performance is generally very good with low resource overhead
- Inference customization: While not as rich as llama.cpp, Modelfile provides options to tweak inference and prompt.
- Works well on CPU-only systems with good quantization support.
- Built-in REST API for cURL/Python, and its own Python binding .
Cons:
- Abstraction overhead: Slightly slower than pure llama.cpp
- Not designed for concurrency: Ollama is not optimized for handling multiple requests simultaneously. Its performance degrades significantly under concurrent loads.
- Less performant for high throughput: Compared to vLLM, it has significantly lower throughput and higher latency when serving multiple users.
- Lacked detailed configuration: While easy to use, it offers less fine-grained control over optimizations over the inference process compared to vLLM
Best For:
- Local development and experimentation, or hobbyist projects on consumer hardware.
- Single-user applications or backends with a single-threaded workload.
- Quick prototyping with a web frontend, as it exposes an easy API.
- Users who prioritize ease of use and a quick setup over raw performance.
- CPU-only setups where simplicity is more important than raw speed.
Ampere optimized llama.cpp
llama.cpp
is a lightweight C++ library optimized for CPU execution. It is renowned for its performance, minimal dependencies, and extensive configuration options over model parameters like thread count, NUMA settings, and quantization methods. It supports ARM architectures natively and is designed for running quantized models efficiently.
llama.cpp
is also the background engine for Ollama and Docker Model Runner (Both services are essentially wrappers that add an API layer on top of llama.cpp
).
How to Use: Run the optimized container . After that, convert models to GGUF format using built-in tools, and run via CLI (e.g.,
./llama-cli -m model.gguf --prompt "Your input"
) to interact with models. For API access, you can run its built-in server (./server
), which provides a powerful, OpenAI-compatible endpoint and also Python bindings .If you want the original llama.cpp, you can install the prebuilt binary , or run it as a Docker Container , or compile it yourself.
Pros:
- Highly efficient CPU inference
- Extremely customizable (e.g., layers offload, quantization)
- Minimal overhead; fastest raw speed
- Low RAM usage
- OpenAI-compatible endpoint and Python bindings makes programmatic access straightforward.
Cons:
- Requires manual model conversion
Docker Model Runner
Docker Model Runner is a relatively new Docker extension (launched in early 2025) providing a simple yet powerful way to manage, download, and run AI models locally using familiar Docker CLI commands.
Per Docker, the model in Docker Model Runner does not run inside a container. Instead, it installs an inference engine such as llama.cpp on the host’s hardware and runs models natively there. This approach prioritizes performance by avoiding the overhead associated with running models within a containerized environment.
- How it works: You use a
docker run
ordocker model run
command to start the service, specifying which model to load. It also provides a REST API endpoint for inference.
Pros:
- Seamless Docker integration: If you’re already using Docker and Docker Compose, Model Runner fits naturally into your existing development and deployment pipelines.
- Easy to use: Running a model is as simple as a
docker model run
command. The OpenAI-compatible API makes it straightforward to integrate with app development. - Native performance: By running the inference engine directly on the host rather than within a container, it avoids virtualization overhead.
- Model management: Models are treated as OCI artifacts for portability, making them easy to version, share, and manage through registries like Docker Hub.
Cons:
- Limited fine-grained control: It offers less control over the underlying inference engine’s parameters compared to vLLM, llama.cpp or Ollama’s Modelfiles.
- Newer and less tested: As a more recent addition to the Docker ecosystem, it has a smaller community and fewer publicly available performance benchmarks, compared to the more established tools.
- Docker Desktop/Engine Dependency: Integration with Docker Desktop or CE introduces overhead if you’re not already in a Docker workflow.
- Limited backend: Currently it only supports llama.cpp, limiting flexibility or optimization if you need other engines. Support for other engines such as vLLM are in the talk.
Best Usage Scenarios:
- Developers already using Docker for app development who want quick local testing of models.
- Prototyping GenAI apps where models need to integrate into existing containerized workflows.
- Environments where model portability (via OCI) is key, like sharing models across teams.
Hugging Face Transformers
Besides being a well known hub for models and datasets, Hugging Face also provides a Python library for loading and running models from their hub. It supports CPU inference via PyTorch’s CPU backend and works on ARM with proper builds.
How to use: Install with
pip install 'transformers[torch]
, then write a Python script to load a model. The simplest way is to leveragepipeline
like:from transformers import pipeline; pipe = pipeline('text-generation', model='meta-llama/Llama-2-7b-hf', device='cpu') pipe("Write a haiku about Transformers")
You can also interact with models using the Chat CLI
Pros:
- Vast model ecosystem (direct from HF Hub)
- Flexible for custom pipelines
- Easy integration with other Python tools
Cons:
- Slower inference on CPU (PyTorch overhead)
- Higher RAM usage for loading
- Less optimized for pure CPU runs
vLLM
vLLM is a high-throughput and memory-efficient inference engine, designed for performance in production environments. It supports OpenAI-compatible APIs and is built on PyTorch.
How to Use: Install from source per instruction, then start a OpenAI-compatible server with:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --tensor-parallel-size 1 --dtype float16`
Interact via CURL to the OpenAI-compatible API or Python binding .
Alternatively, it can be run as Docker container .
Pros:
- Exceptional performance and scalability Through innovations like PagedAttention, vLLM significantly reduces memory waste and allows for larger models or much higher batch sizes, leading to high throughput and lower latency with concurrent requests in a production-like serving.
- Advanced features: Supports continuous batching, optimized CUDA kernels, tensor parallelism, and custom quantization.
Cons:
- Higher complexity: Setting up and configuring vLLM can be more involved than the other options. Might be overkill for sequential user settings.
- Resource intensive: While it uses memory efficiently, it is designed for performance and may have higher baseline resource requirements for larger models.
- Primarily GPU-focused: Its performance optimizations are most pronounced on NVIDIA GPUs, and CPU support is less efficient.
Best For:
- Production environments where high throughput and the ability to handle concurrent users are critical, such as API servers handling multiple users.
- Serving large models where efficient memory management is crucial.
- GPU-heavy environments for scaling LLMs efficiently.
- Applications requiring low-latency responses, like chatbots or real-time AI services.
- Advanced users needing customization, such as fine-tuned quantization or distributed inference.
The Verdict: Ampere-optimized Ollama
Our user scenario is Oracle Linux VM on 4-core, 24GB Ampere A1 compute with no GPU. We will be building a Python web app frontend, and need an inference server that exposes an API (e.g., OpenAI-compatible) for the app to query the model. The goal is efficient, reliable local serving without high concurrency demands, given the modest resources.
In this case, Ampere Optimized Ollama is the best all-around choice. It offers the best balance of simplicity (one-command install, automatic model handling), solid CPU performance on ARM, and manageable resource consumption.
Because it’s been specifically tuned for Ampere hardware, it closes the performance gap with a manually compiled llama.cpp
while being simpler to set up and manage. It’s efficient on ARM and allows running up to 13B models with reasonable latency for tasks like code generation or Q&A. The built-in API supports cURL or Python integration, allowing us to get the Python web app up and running with local LLM inference under the least amount of friction.
Runner-ups
With only 4 cores, parallelism is limited as larger models will feel sluggish compared to an x86 machine with more threads. All tools benefit from quantization, which reduces model size and computation at a minor accuracy cost. Nevertheless, llama.cpp edges out others in raw speed due to its C++ implementation and optimization for running LLMs on CPUs. llama.cpp’s quantization reduces model size and speeds up inference without needing GPU tensors. Ollama, being a wrapper should be nearly as fast but adds convenience. Hugging Face will be slower because of Python overhead and less aggressive CPU-specific tuning. vLLM shines in batched scenarios (e.g., multiple prompts) but offers similar single-query performance to Ollama.
- Docker Model Runner for existing Docker workflows If you are heavily invested in a Docker-centric workflow, Docker Model Runner is a very viable alternative.
Model Runner requires Docker Engine, adding unnecessary layers (e.g., CLI integration) if we’re not containerizing our app. In comparison, Ollama’s lightweight nature and proven track record for simple, local CPU inference give it the edge in a resource-constrained environment where you want to minimize overhead.
In addition, models run on the host anyway, so isolation benefits are minimal.
- vLLM for scalability: If your web app grows to handle many users, you can migrate to vLLM later. For now, Ollama suffices.
vLLM supports Arm CPUs (via its CPU backend with FP16/BF16 datatypes and NEON), its primary performance advantages are realized with GPU. On a CPU-only server, many of its key optimizations will not be as impactful. For example, PagedAttention is CUDA-centric, though CPU ports exist. On pure CPU/Arm, it underperforms compared to llama.cpp-based tools as benchmarks show 2-5x lower throughput than on GPUs.
Moreover, our scenario (a Python web app) is unlikely to be serving a high volume of concurrent users from the get-go. vLLM is overkill for such low-load, local scenario and could waste resources on complex configs. In comparison, Ollama is well-suited for a sequential workload like a Python web app processing user requests one at a time.
In summary, its higher complexity doesn’t justify the potential for a marginal performance gain over Ollama in a single-user context.
- llama.cpp for advanced customization: If you are an advanced user who need to maximize performance and control every single inference parameter, and are comfortable with more setup, compiling llama.cpp from source is a fantastic runner-up. However, for most use cases, the convenience and optimized performance of Ampere Ollama make it the superior choice.
Largest model with reasonable performance
What models can we run reliably and comfortably in our current setting? With 24GB RAM, we need to account for the OS and other processes (~2-4GB reserved). Let’s assume we have about 20-22GB of usable RAM for our model’s weights and KV cache for context.
Quantization (e.g., to 4-bit or 5-bit) is essential for reducing the precision of the model’s weights and shrinking memory usage per parameter to ~0.5-0.625 bytes. This makes the file size smaller and faster to run on CPUs, with a minimal impact on quality for most tasks. The most common format is GGUF
, with popular quantization levels like Q4_K_M
.
RAM Estimation Formula: For a Q4-quantized model, RAM ≈ (parameters in billions × 0.5 GB) + KV cache (0.5-2GB for 512-2048 tokens) + overhead.
- 7B Models (e.g., Llama 3 8B, Mistral 7B, Qwen2.5 7B): A
Q4_K_M
quantized 7B model is ~4-8 GB . This is the sweet spot for our machine. It will load quickly, leave plenty of RAM for a large context window, and provide good performance. - 13B Models (e.g., Code Llama 13B): These models are ~7-10 GB when quantized. They will also run well on our system, offering a significant boost in reasoning capability over 7B models without pushing RAM limits.
- 34B+ Models: A quantized 34B model is ~15-25 GB. While it might technically load into our 24GB of RAM, it leaves very little room for the operating system or the context cache. Performance will likely be very slow, especially the Time to First Token (TTFT), as the system may struggle with memory management. This is not recommended for an interactive or responsive application.
- 70B Models:: ~35-40GB, exceeding 24GB. Not feasible.
Reasonable Performance Threshold: Define as >5 t/s for generation, with TTFT <5s. On 4 cores:
- A 30B Q4 model might achieve 3-8 t/s, depending on the tool (faster in llama.cpp).
- This is reasonable for interactive use like chatbots, but expect slower responses for long outputs. Smaller models (7-13B) might hit 10-20 t/s.
- Benchmarks on similar ARM servers (e.g., AWS Graviton or Ampere) show 7B models at 15-30 t/s on 4-8 cores, scaling down for larger ones.
- Running bigger models will require either disk offloading (slower) or upgrading to more RAM/cores via A1’s Flex shape.
Conclusion: For reasonable performance, 7B and 13B models are our best bet. Start with a 7-13B model like Llama 3.1 8B for testing, then scale up. Keep monitor RAM usage with htop
or free -h
to avoid swapping, which kills performance.
Appendix: Comparison table
Feature | Ampere Optimized Ollama | Docker Model Runner | llama.cpp | vLLM | HF Transformer |
---|---|---|---|---|---|
Primary Focus | Simplicity and accessibility for local development and single‑user applications, tuned for Ampere Altra/Altra Max CPUs. | Ease of use and integration with the Docker ecosystem for running llama.cpp‑based models. | Efficient, lightweight inference engine for running LLMs on consumer hardware with quantization support. | High‑throughput and low‑latency inference for production environments. | Flexibility for model loading, fine‑tuning, and inference in Python environments using PyTorch. |
Best usage scenario | Single‑user tasks and local development where Ampere CPU optimization matters; not designed for high concurrency. | Single‑user or small‑team local development; performance depends on host CPU and Docker overhead. | Local development, experimentation, and efficient single‑user inference on diverse hardware including low‑resource devices. | Concurrent, batched inference in production; excels at serving many users with low latency. | Prototyping, research, fine‑tuning, and custom inference pipelines in Python workflows. |
Architecture | Wraps llama.cpp in a user‑friendly CLI and serves models via a local API; Ampere‑specific build flags. | Runs llama.cpp natively inside a container; models packaged as OCI artifacts. | Lightweight C/C++ library/server for LLM inference, supporting GGUF, quantization, CPU/GPU backends. | Optimized C++/CUDA engine with PagedAttention, tensor parallelism, and continuous batching. | Python‑based library using PyTorch to load and run transformer models, with pipeline abstractions. |
Performance | Ampere‑tuned build yields excellent TTFT and tokens/sec; >4× faster TTFT than generic containers after warmup. | Lower than native due to container overhead; may not fully exploit ARM64 SIMD. | Highest raw performance possible with tuned compilation and runtime flags. | Exceptional throughput in batched mode; competitive single‑request latency. | Slower than specialized engines due to Python overhead; improves with quantization and torch.compile. |
Flexibility | Modelfile customization; fewer runtime flags than llama.cpp.. | Configurable via env vars or container args; less flexible than direct llama.cpp.. | Full control over model loading, quantization, threading, and backends. | Flexible serving configs; less low‑level control than llama.cpp.. | Full access to model internals; integrates with PyTorch ecosystem. |
Ease of Use | Very easy: docker pull + ollama run. | Easy for Docker users; model volume management adds minor complexity. | Requires manual build and GGUF model download; straightforward for CLI users. | More complex; requires Python/CUDA setup. | Easy for Python devs; pip install + pipelines; manage venvs. |
API Access | Built‑in OpenAI‑compatible API. | OpenAI‑compatible API endpoint. | Built‑in fast, OpenAI‑compatible server. | Built‑in OpenAI‑compatible API server. | Transformers Python library |
ARM Support | Excellent; Ampere‑optimized binaries. | Multi‑arch Docker images; some overhead on ARM. | Excellent via NEON and cross‑compile. | Supported; may need source build or ARM wheels. | Supported via PyTorch ARM builds; works on Apple Silicon/Linux ARM. |
Resource Usage | Low for small models; scales with model size. | Moderate; Docker overhead present. | Very efficient with quantization; low CPU RAM use. | High for large models; efficient memory mgmt in batches. | High due to Python/full precision; quantization helps. |
Throughput (t/s on 7B Model) | ~12–20 t/s | ~10–18 t/s | ~15–25 t/s | ~10–18 t/s (batched) | ~5–10 t/s |
Latency (TTFT for 512 Tokens) | 0.8–2 s | 1–2.5 s | 0.5–1.5 s | 1–3 s | 2–5 s |
ARM/4‑Core Fit | Strong ARM perf; close to llama.cpp with easier scaling. | Good via multi‑arch; slight container penalty. | Excellent; NEON vector ops; scales well. | Works on ARM; throughput drops on few cores. | Works on ARM; slower due to Python; fine for dev. |
Comments