Running the LLM Server on a Separate Server#
This section describes the setup for running only the LLM server on a separate server, without the other components of Sherpa AI Server.
When This Is Needed#
Running the LLM server on a separate server is useful when:
- You need to distribute the load between servers
- The LLM server requires powerful GPUs and is better off on a separate server
- Scaling is required - multiple LLM servers for load balancing
- You need to isolate the LLM server from the main application
Requirements#
- A server with NVIDIA GPU (CUDA 11.8+)
- Docker and Docker Compose installed
- NVIDIA Container Toolkit installed
- LLM models loaded in the directory
llm-server/models/
Setup#
Step 1: Prepare the Server#
Make sure all necessary components are installed on the server
# Check GPU
nvidia-smi
# Check Docker
docker --version
docker compose version
Step 2: Prepare Files#
Copy the following files and directories to the server:
# Required files:
# - docker-compose.yml (or docker-compose.main.yml)
# - .env file with settings
# - llm-server/models/ - directory with models
# - llm-server/templates/ - directory with templates (if used)
Step 3: Comment Out Unnecessary Services#
Open the docker-compose.yml file and comment out all services except aiserver-llm-server.
Example: Commented Services
services:
# aiserver-pg:
# container_name: aiserver-pg
# image: aiserver-pg:latest
# ...
# aiserver-embed:
# container_name: aiserver-embed
# ...
# aiserver:
# container_name: aiserver
# ...
# The only active service:
aiserver-llm-server:
container_name: aiserver-llm-server
image: aiserver-llm-server:latest
restart: always
env_file:
- .env
ports:
- 3003:8000
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
volumes:
- "./llm-server/models:/model-store"
- "./llm-server/templates:/model-templates"
networks:
- llm-net
# aiserver-code_interpreter:
# ...
# aiserver-whisper:
# ...
# aiserver-bge_reranker:
# ...
Step 5: Configure Environment Variables#
Create or edit the .env file with LLM server settings:
# LLM server settings
LLM_CUDA_VISIBLE_DEVICES=0
LLM_TENSOR_PARALLEL_SIZE=1
LLM_GPU_MEMORY_UTILIZATION=0.90
LLM_COMPLETION_MODEL_NAME=/model-store/meta-llama/Meta-Llama-3-8B-Instruct
LLM_DTYPE=auto
LLM_TRUST_REMOTE_CODE=false
LLM_QUANTIZATION=false
LLM_MAX_MODEL_LEN=8192
LLM_HOST=0.0.0.0
LLM_PORT=8000
LLM_MAX_NUM_BATCHED_TOKENS=16384
LLM_MAX_NUM_SEQS=16
LLM_ENABLE_TOOLS=true
LLM_TOOL_CALL_PARSER=llama3_json
LLM_EXCLUDE_TOOLS_WHEN_NONE=true
Important:
- Ensure that the model path is correct:
LLM_COMPLETION_MODEL_NAME=/model-store/model-name
Step 6: Check Configuration#
Before starting, check the configuration:
# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config
# Check if the port is free
netstat -tuln | grep 3003
# Check for the model
ls -la llm-server/models/
Step 7: Start the LLM Server#
# Start only the LLM server
docker compose -f docker-compose.yml up -d aiserver-llm-server
# Or start everything (but only uncommented services will run)
docker compose -f docker-compose.yml up -d
# Check the status
docker compose -f docker-compose.yml ps
Expected Result: Only the aiserver-llm-server container should start.
Step 8: Check Operation#
# Check the logs
docker logs aiserver-llm-server
# Check GPU usage
nvidia-smi
# Check API availability (should return model information)
curl http://localhost:3003/v1/models
Expected Result:
- The container should start successfully
- There should be no critical errors in the logs
- The API should respond to requests
- The GPU should be used for loading the model
Connecting from Another Server#
If the LLM server is running on a separate server, configure the connection from the main server.
On the Main Server#
In the .env file of the main server, specify the address of the LLM server:
# Address of the LLM server (replace with the IP or domain of your LLM server)
LLM_SERVER_URL=http://192.168.1.100:3003
# or
LLM_SERVER_URL=http://llm-server.example.com:3003
Minimal Docker-Compose Configuration#
Example of a minimal docker-compose.yml for the LLM server only:
services:
aiserver-llm-server:
container_name: aiserver-llm-server
image: aiserver-llm-server:latest
restart: always
env_file:
- .env
ports:
- "3003:8000"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
volumes:
- "./llm-server/models:/model-store"
- "./llm-server/templates:/model-templates"
networks:
- llm-net
networks:
llm-net:
name: llm-net
driver: bridge
Save this file as docker-compose.llm-only.yml and use:
docker compose -f docker-compose.llm-only.yml up -d
Possible Issues#
Container Does Not Start#
Problem: The container crashes immediately after starting.
Solution:
- Check the logs:
docker logs aiserver-llm-server - Ensure the GPU is available:
nvidia-smi - Check that the model exists:
ls -la llm-server/models/ - Check permissions for the model directory
Model Does Not Load#
Problem: Errors when loading the model.
Solution:
- Check the model path in
.env:LLM_COMPLETION_MODEL_NAME - Ensure the model is loaded:
ls -la llm-server/models/ - Check logs for loading errors:
docker logs aiserver-llm-server | grep -i error
Insufficient GPU Memory#
Problem: The model does not fit in GPU memory.
Solution:
- Reduce
LLM_GPU_MEMORY_UTILIZATIONin.env - Use the quantized version of the model (set
LLM_QUANTIZATION=true) - Use a smaller model
Port Not Accessible Externally#
Problem: Cannot connect to the LLM server from another server.
Solution:
- Check the firewall:
sudo ufw status - Check that the port is forwarded:
docker port aiserver-llm-server - Check Docker network settings
Monitoring#
To monitor the operation of the LLM server:
# Container status
docker ps | grep llm-server
# Resource usage
docker stats aiserver-llm-server
# GPU usage
watch -n 1 nvidia-smi
# Real-time logs
docker logs -f aiserver-llm-server
# API check
curl http://localhost:3003/health
curl http://localhost:3003/v1/models
Performance Optimization#
To optimize the performance of the LLM server:
Configure GPU Memory:
LLM_GPU_MEMORY_UTILIZATION=0.90 # Use maximum available memoryBatching Configuration:
LLM_MAX_NUM_BATCHED_TOKENS=16384 LLM_MAX_NUM_SEQS=16Use Quantization:
LLM_QUANTIZATION=true # For models that support quantization
After completing all steps, you should have the LLM server running on a separate server, which can be used from the main server or other applications via the API on port 3003.