Running the LLM Server on a Separate Server#

This section describes the setup for running only the LLM server on a separate server, without the other components of Sherpa AI Server.

When This Is Needed#

Running the LLM server on a separate server is useful when:

  • You need to distribute the load between servers
  • The LLM server requires powerful GPUs and is better off on a separate server
  • Scaling is required - multiple LLM servers for load balancing
  • You need to isolate the LLM server from the main application

Requirements#

  • A server with NVIDIA GPU (CUDA 11.8+)
  • Docker and Docker Compose installed
  • NVIDIA Container Toolkit installed
  • LLM models loaded in the directory llm-server/models/

Setup#

Step 1: Prepare the Server#

Make sure all necessary components are installed on the server

# Check GPU
nvidia-smi

# Check Docker
docker --version
docker compose version

Step 2: Prepare Files#

Copy the following files and directories to the server:

# Required files:
# - docker-compose.yml (or docker-compose.main.yml)
# - .env file with settings
# - llm-server/models/ - directory with models
# - llm-server/templates/ - directory with templates (if used)

Step 3: Comment Out Unnecessary Services#

Open the docker-compose.yml file and comment out all services except aiserver-llm-server.

Example: Commented Services

services:

  # aiserver-pg:
  #   container_name: aiserver-pg
  #   image: aiserver-pg:latest
  #   ...

  # aiserver-embed:
  #   container_name: aiserver-embed
  #   ...

  # aiserver:
  #   container_name: aiserver
  #   ...

  # The only active service:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - 3003:8000
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

  # aiserver-code_interpreter:
  #   ...

  # aiserver-whisper:
  #   ...

  # aiserver-bge_reranker:
  #   ...

Step 5: Configure Environment Variables#

Create or edit the .env file with LLM server settings:

# LLM server settings
LLM_CUDA_VISIBLE_DEVICES=0
LLM_TENSOR_PARALLEL_SIZE=1
LLM_GPU_MEMORY_UTILIZATION=0.90
LLM_COMPLETION_MODEL_NAME=/model-store/meta-llama/Meta-Llama-3-8B-Instruct
LLM_DTYPE=auto
LLM_TRUST_REMOTE_CODE=false
LLM_QUANTIZATION=false
LLM_MAX_MODEL_LEN=8192
LLM_HOST=0.0.0.0
LLM_PORT=8000
LLM_MAX_NUM_BATCHED_TOKENS=16384
LLM_MAX_NUM_SEQS=16
LLM_ENABLE_TOOLS=true
LLM_TOOL_CALL_PARSER=llama3_json
LLM_EXCLUDE_TOOLS_WHEN_NONE=true

Important:

  • Ensure that the model path is correct: LLM_COMPLETION_MODEL_NAME=/model-store/model-name

Step 6: Check Configuration#

Before starting, check the configuration:

# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check if the port is free
netstat -tuln | grep 3003

# Check for the model
ls -la llm-server/models/

Step 7: Start the LLM Server#

# Start only the LLM server
docker compose -f docker-compose.yml up -d aiserver-llm-server

# Or start everything (but only uncommented services will run)
docker compose -f docker-compose.yml up -d

# Check the status
docker compose -f docker-compose.yml ps

Expected Result: Only the aiserver-llm-server container should start.

Step 8: Check Operation#

# Check the logs
docker logs aiserver-llm-server

# Check GPU usage
nvidia-smi

# Check API availability (should return model information)
curl http://localhost:3003/v1/models

Expected Result:

  • The container should start successfully
  • There should be no critical errors in the logs
  • The API should respond to requests
  • The GPU should be used for loading the model

Connecting from Another Server#

If the LLM server is running on a separate server, configure the connection from the main server.

On the Main Server#

In the .env file of the main server, specify the address of the LLM server:

# Address of the LLM server (replace with the IP or domain of your LLM server)
LLM_SERVER_URL=http://192.168.1.100:3003
# or
LLM_SERVER_URL=http://llm-server.example.com:3003

Minimal Docker-Compose Configuration#

Example of a minimal docker-compose.yml for the LLM server only:

services:
  aiserver-llm-server:
    container_name: aiserver-llm-server
    image: aiserver-llm-server:latest
    restart: always
    env_file:
      - .env
    ports:
      - "3003:8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
    volumes:
      - "./llm-server/models:/model-store"
      - "./llm-server/templates:/model-templates"
    networks:
      - llm-net

networks:
  llm-net:
    name: llm-net
    driver: bridge

Save this file as docker-compose.llm-only.yml and use:

docker compose -f docker-compose.llm-only.yml up -d

Possible Issues#

Container Does Not Start#

Problem: The container crashes immediately after starting.

Solution:

  1. Check the logs: docker logs aiserver-llm-server
  2. Ensure the GPU is available: nvidia-smi
  3. Check that the model exists: ls -la llm-server/models/
  4. Check permissions for the model directory

Model Does Not Load#

Problem: Errors when loading the model.

Solution:

  1. Check the model path in .env: LLM_COMPLETION_MODEL_NAME
  2. Ensure the model is loaded: ls -la llm-server/models/
  3. Check logs for loading errors: docker logs aiserver-llm-server | grep -i error

Insufficient GPU Memory#

Problem: The model does not fit in GPU memory.

Solution:

  • Reduce LLM_GPU_MEMORY_UTILIZATION in .env
  • Use the quantized version of the model (set LLM_QUANTIZATION=true)
  • Use a smaller model

Port Not Accessible Externally#

Problem: Cannot connect to the LLM server from another server.

Solution:

  1. Check the firewall: sudo ufw status
  2. Check that the port is forwarded: docker port aiserver-llm-server
  3. Check Docker network settings

Monitoring#

To monitor the operation of the LLM server:

# Container status
docker ps | grep llm-server

# Resource usage
docker stats aiserver-llm-server

# GPU usage
watch -n 1 nvidia-smi

# Real-time logs
docker logs -f aiserver-llm-server

# API check
curl http://localhost:3003/health
curl http://localhost:3003/v1/models

Performance Optimization#

To optimize the performance of the LLM server:

  1. Configure GPU Memory:

    LLM_GPU_MEMORY_UTILIZATION=0.90  # Use maximum available memory
    
  2. Batching Configuration:

    LLM_MAX_NUM_BATCHED_TOKENS=16384
    LLM_MAX_NUM_SEQS=16
    
  3. Use Quantization:

    LLM_QUANTIZATION=true  # For models that support quantization
    

After completing all steps, you should have the LLM server running on a separate server, which can be used from the main server or other applications via the API on port 3003.