Running the LLM Server on a Separate Server#

This section describes the setup for running only the LLM server on a separate server, without the other components of Sherpa AI Server.

When This Is Needed#

Running the LLM server on a separate server is useful when:

You need to distribute the load between servers
The LLM server requires powerful GPUs and is better off on a separate server
Scaling is required - multiple LLM servers for load balancing
You need to isolate the LLM server from the main application

Requirements#

A server with NVIDIA GPU (CUDA 11.8+)
Docker and Docker Compose installed
NVIDIA Container Toolkit installed
LLM models loaded in the directory llm-server/models/

Setup#

Step 1: Prepare the Server#

Make sure all necessary components are installed on the server

# Check GPU
nvidia-smi

# Check Docker
docker --version
docker compose version

Step 2: Prepare Files#

Copy the following files and directories to the server:

# Required files:
# - docker-compose.yml (or docker-compose.main.yml)
# - .env file with settings
# - llm-server/models/ - directory with models
# - llm-server/templates/ - directory with templates (if used)

Step 3: Comment Out Unnecessary Services#

Open the docker-compose.yml file and comment out all services except aiserver-llm-server.

Example: Commented Services

services:

  # aiserver-pg:
  #   container_name: aiserver-pg
  #   image: aiserver-pg:latest
  #   ...

  # aiserver-embed:
  #   container_name: aiserver-embed
  #   ...

  # aiserver:
  #   container_name: aiserver
  #   ...

  # The only active service:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - 3003:8000
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

  # aiserver-code_interpreter:
  #   ...

  # aiserver-whisper:
  #   ...

  # aiserver-bge_reranker:
  #   ...

Step 5: Configure Environment Variables#

Create or edit the .env file with LLM server settings:

# LLM server settings
LLM_CUDA_VISIBLE_DEVICES=0
LLM_TENSOR_PARALLEL_SIZE=1
LLM_GPU_MEMORY_UTILIZATION=0.90
LLM_COMPLETION_MODEL_NAME=/model-store/meta-llama/Meta-Llama-3-8B-Instruct
LLM_DTYPE=auto
LLM_TRUST_REMOTE_CODE=false
LLM_QUANTIZATION=false
LLM_MAX_MODEL_LEN=8192
LLM_HOST=0.0.0.0
LLM_PORT=8000
LLM_MAX_NUM_BATCHED_TOKENS=16384
LLM_MAX_NUM_SEQS=16
LLM_ENABLE_TOOLS=true
LLM_TOOL_CALL_PARSER=llama3_json
LLM_EXCLUDE_TOOLS_WHEN_NONE=true

Important:

Ensure that the model path is correct: LLM_COMPLETION_MODEL_NAME=/model-store/model-name

Step 6: Check Configuration#

Before starting, check the configuration:

# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check if the port is free
netstat -tuln | grep 3003

# Check for the model
ls -la llm-server/models/

Step 7: Start the LLM Server#

# Start only the LLM server
docker compose -f docker-compose.yml up -d aiserver-llm-server

# Or start everything (but only uncommented services will run)
docker compose -f docker-compose.yml up -d

# Check the status
docker compose -f docker-compose.yml ps

Expected Result: Only the aiserver-llm-server container should start.

Step 8: Check Operation#

# Check the logs
docker logs aiserver-llm-server

# Check GPU usage
nvidia-smi

# Check API availability (should return model information)
curl http://localhost:3003/v1/models

Expected Result:

The container should start successfully
There should be no critical errors in the logs
The API should respond to requests
The GPU should be used for loading the model

Connecting from Another Server#

If the LLM server is running on a separate server, configure the connection from the main server.

On the Main Server#

In the .env file of the main server, specify the address of the LLM server:

# Address of the LLM server (replace with the IP or domain of your LLM server)
LLM_SERVER_URL=http://192.168.1.100:3003
# or
LLM_SERVER_URL=http://llm-server.example.com:3003

Minimal Docker-Compose Configuration#

Example of a minimal docker-compose.yml for the LLM server only:

services:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - "3003:8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

networks:
  llm-net:
    name: llm-net
    driver: bridge

Save this file as docker-compose.llm-only.yml and use:

docker compose -f docker-compose.llm-only.yml up -d

Possible Issues#

Container Does Not Start#

Problem: The container crashes immediately after starting.

Solution:

Check the logs: docker logs aiserver-llm-server
Ensure the GPU is available: nvidia-smi
Check that the model exists: ls -la llm-server/models/
Check permissions for the model directory

Model Does Not Load#

Problem: Errors when loading the model.

Solution:

Check the model path in .env: LLM_COMPLETION_MODEL_NAME
Ensure the model is loaded: ls -la llm-server/models/
Check logs for loading errors: docker logs aiserver-llm-server | grep -i error

Insufficient GPU Memory#

Problem: The model does not fit in GPU memory.

Solution:

Reduce LLM_GPU_MEMORY_UTILIZATION in .env
Use the quantized version of the model (set LLM_QUANTIZATION=true)
Use a smaller model

Port Not Accessible Externally#

Problem: Cannot connect to the LLM server from another server.

Solution:

Check the firewall: sudo ufw status
Check that the port is forwarded: docker port aiserver-llm-server
Check Docker network settings

Monitoring#

To monitor the operation of the LLM server:

# Container status
docker ps | grep llm-server

# Resource usage
docker stats aiserver-llm-server

# GPU usage
watch -n 1 nvidia-smi

# Real-time logs
docker logs -f aiserver-llm-server

# API check
curl http://localhost:3003/health
curl http://localhost:3003/v1/models

Performance Optimization#

To optimize the performance of the LLM server:

Configure GPU Memory:

LLM_GPU_MEMORY_UTILIZATION=0.90  # Use maximum available memory

Batching Configuration:

LLM_MAX_NUM_BATCHED_TOKENS=16384
LLM_MAX_NUM_SEQS=16

Use Quantization:

LLM_QUANTIZATION=true  # For models that support quantization

After completing all steps, you should have the LLM server running on a separate server, which can be used from the main server or other applications via the API on port 3003.