Installing Sherpa AIServer (CPU-only, no GPU)#

This guide describes installing and running the stack when the language model runs on CPU (no NVIDIA GPU). The preparation steps are the same as the general installation but differ in the set of Docker images, compose file, and startup profile.

Suitable for#

  • A server without an NVIDIA GPU or where GPU is not required for LLMs.
  • Accepting significantly slower LLM response times compared to GPU.

Important: Services like Whisper and BGE Reranker in docker-compose.yml are designed for GPU. For CPU-only scenarios do not enable whisper, reranker, or full profiles — use the cpu profile instead (and decide separately about embed/pg which do not require NVIDIA).

What to download (same as regular installation, replacing the LLM image)#

Download the client archive, models, and other images.

Differences from GPU installation#

Component GPU installation CPU installation
LLM image aiserver-llm-server aiserver-llm-server-cpu
Compose launch docker-compose.yml without cpu profile docker-compose.yml with cpu profile

Everything else follows the general preparation list if you don't use optional GPU services.

  • Client archive: client-files/latest
  • Images: aiserver, aiserver-pg, aiserver-embed, aiserver-code-interpreter, aiserver-nginx, aiserver-websocket, optionally aiserver-whisper and aiserver-bge-reranker (only if you plan GPU profiles)
  • Models: embed_model_store.tar.gz (required for embed), and one LLM model of your choice

Download the aiserver-llm-server-cpu image instead of aiserver-llm-server (or additionally if you want both).

The aiserver-llm-server image is optional for CPU scenarios — you may skip it to save space and bandwidth.

Unpack and load images into Docker#

  1. Unpack the client archive and prepare scripts.
  2. Load images via sudo ./sh_scripts/load_all_docker_images.sh.
  3. If there is an aiserver-llm-server-cpu_*.tar.gz archive and the loader does not pick it up, import manually:
docker load --input aiserver-llm-server-cpu_*.tar.gz
  1. Run extract_models.sh, extract_vllm.sh and configure .env and certificates as in the installation.

Configure .env for CPU-LLM#

The main aiserver service connects to the LLM via LLM_MODEL_API_BASE_URL / LLM_API_HOST and LLM_API_PORT. For the aiserver-llm-server-cpu container the API listens on port 8000 inside the Docker network (host mapping in docker-compose.yml is 3007).

Set:

LLM_API_HOST=aiserver-llm-server-cpu
LLM_API_PORT=8000
LLM_MODEL_API_BASE_URL=http://aiserver-llm-server-cpu:8000

Start with docker-compose.yml and cpu profile#

In the installation directory use the provided docker-compose.yml and explicitly enable the cpu profile so the aiserver-llm-server-cpu service starts and GPU profiles (gpu, gpu2) are not activated.

Basic start (CPU LLM only)#

docker compose -f docker-compose.yml --profile cpu up -d

Stop#

docker compose -f docker-compose.yml --profile cpu down

Verify containers#

docker compose -f docker-compose.yml --profile cpu ps

The aiserver-llm-server-cpu container should be Up.

Optional host check for LLM#

The compose file maps port 8000 to 3007 on the host. Verify the model or health endpoint:

curl -sS "http://127.0.0.1:3007/v1/models" || curl -sS "http://127.0.0.1:3007/health"

Check logs if needed: docker compose -f docker-compose.yml logs -f aiserver-llm-server-cpu.

Quick checklist#

  1. Download artifacts as for a normal installation, but use the aiserver-llm-server-cpu image instead of aiserver-llm-server.
  2. Unpack client files, load images, extract models.
  3. Set .env: LLM_MODEL_API_BASE_URL=http://aiserver-llm-server-cpu:8000 (and matching LLM_API_HOST/LLM_API_PORT).
  4. Start: docker compose -f docker-compose.yml --profile cpu up -d.
  5. Do not enable whisper / reranker / full if no GPU is available.

For CPU-only installs the NVIDIA/Container Toolkit sections can be skipped unless you run GPU containers.