Running Two LLM Containers Simultaneously#
This section describes the setup for running two language models simultaneously on one server with multiple GPUs.
When This Is Needed#
Running two LLM containers is useful when:
- You have multiple GPUs and want to use them for different models
- You need to run different models at the same time (for example, one for chat, another for specialized tasks)
- You need to distribute the load between multiple models
Requirements#
- A server with at least 2 NVIDIA GPUs
- Each GPU must have enough memory for the chosen model
- Docker and Docker Compose installed
- NVIDIA Container Toolkit installed
Setup#
Step 1: Check Available GPUs#
Make sure you have at least 2 GPUs:
nvidia-smi
Expected Result: You should see at least 2 GPUs in the list.
Step 2: Uncomment the Second LLM Container#
Open the docker-compose.yml file and find the commented block aiserver-llm-server2 (around lines 103-142).
Uncomment the entire block by removing the # characters at the beginning of each line:
# Was:
# aiserver-llm-server2:
# container_name: aiserver-llm-server2
# image: aiserver-llm-server:latest
# ...
# Now:
aiserver-llm-server2:
container_name: aiserver-llm-server2
image: aiserver-llm-server:latest
...
Step 3: Configure Ports#
Make sure the ports do not conflict:
- aiserver-llm-server (first container): port
3003:8000 - aiserver-llm-server2 (second container): the port should be different, for example
3006:8000or3007:8000
In the uncommented block, check the line:
ports:
- 3006:8000 # or another free port
Step 4: Configure GPU for Each Container#
It is important to set which GPU will be used by each container.
For the First Container (aiserver-llm-server)
Typically uses GPU 0 (by default). Check the environment variables in the .env file or in the docker-compose.yml itself:
environment:
LLM_CUDA_VISIBLE_DEVICES: 0 # or do not specify, then GPU 0 will be used
For the Second Container (aiserver-llm-server2)
In the uncommented block, find the line:
environment:
LLM_CUDA_VISIBLE_DEVICES: 1 # Uses GPU 1
Make sure the value corresponds to the number of the second GPU (usually 1 for the second GPU).
Step 5: Configure Models#
Ensure that each model is configured correctly:
First Container (aiserver-llm-server)
Uses settings from the .env file or default values. Check the variable:
LLM_COMPLETION_MODEL_NAME=/model-store/model-name-1
Second Container (aiserver-llm-server2)
In the uncommented block, find the line:
environment:
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
Change it to the required model if a different one is needed.
Step 6: Check Configuration#
Before running, check the configuration:
# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config
# Check that the ports are not occupied
netstat -tuln | grep -E '3003|3006'
Step 7: Start Containers#
# Stop current containers (if running)
docker compose -f docker-compose.yml down
# Start all containers including the second LLM server
docker compose -f docker-compose.yml up -d
# Check that both containers are running
docker compose -f docker-compose.yml ps | grep llm-server
Expected Result: You should see two containers:
aiserver-llm-server(port 3003)aiserver-llm-server2(port 3006)
Step 8: Check Operation#
# Check logs of the first container
docker logs aiserver-llm-server
# Check logs of the second container
docker logs aiserver-llm-server2
# Check GPU usage
nvidia-smi
Expected Result:
- Both containers should start successfully
- In
nvidia-smi, processes should be visible on different GPUs - Logs should not contain critical errors
Setting Environment Variables#
If you need to change the settings for the second container, edit the environment block in docker-compose.yml:
aiserver-llm-server2:
environment:
LLM_COMPLETION_MODEL_NAME: "/model-store/your-model"
LLM_CUDA_VISIBLE_DEVICES: 1 # GPU number (0, 1, 2, etc.)
LLM_TENSOR_PARALLEL_SIZE: "1"
LLM_MAX_MODEL_LEN: "16000"
LLM_GPU_MEMORY_UTILIZATION: "0.85"
# ... other settings
Possible Issues#
Container Does Not Start#
Problem: The second container does not start or crashes with an error.
Solution:
- Check logs:
docker logs aiserver-llm-server2 - Ensure GPU is available:
nvidia-smi - Check that the port is free:
netstat -tuln | grep 3006 - Check that the model exists:
ls -la llm-server/models/
Port Conflict#
Problem: Error "port is already allocated".
Solution:
- Change the port of the second container to a free one (for example,
3007:8000) - Or stop the service occupying the port
Insufficient GPU Memory#
Problem: The model does not load, memory errors.
Solution:
- Decrease
LLM_GPU_MEMORY_UTILIZATION(for example, to0.7) - Use smaller models
- Free up GPU memory by stopping other processes
Both Containers Use One GPU#
Problem: Both containers use GPU 0 instead of different GPUs.
Solution:
- Ensure that
LLM_CUDA_VISIBLE_DEVICESis set correctly for each container - Check that the variable is not overridden in the
.envfile - Restart the containers after changing settings
Example Full Configuration#
Example setup of two containers in docker-compose.yml:
aiserver-llm-server:
container_name: aiserver-llm-server
image: aiserver-llm-server:latest
restart: always
env_file:
- .env
ports:
- 3003:8000
environment:
LLM_CUDA_VISIBLE_DEVICES: 0 # GPU 0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
volumes:
- "./llm-server/models:/model-store"
networks:
- llm-net
aiserver-llm-server2:
container_name: aiserver-llm-server2
image: aiserver-llm-server:latest
restart: always
ports:
- 3006:8000
environment:
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
LLM_CUDA_VISIBLE_DEVICES: 1 # GPU 1
LLM_TENSOR_PARALLEL_SIZE: "1"
LLM_MAX_MODEL_LEN: "16000"
LLM_GPU_MEMORY_UTILIZATION: "0.85"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
volumes:
- "./llm-server/models:/model-store"
networks:
- llm-net
Additional Settings#
Using Different Models#
You can run different models in each container:
# First container - chat model
LLM_COMPLETION_MODEL_NAME: "/model-store/Llama-3-8B"
# Second container - code model
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
Memory Configuration#
If you have GPUs with different memory sizes, configure memory usage for each container:
# For GPU with less memory
LLM_GPU_MEMORY_UTILIZATION: "0.7"
# For GPU with more memory
LLM_GPU_MEMORY_UTILIZATION: "0.9"
Monitoring#
To monitor the operation of both containers:
# Status of containers
docker compose -f docker-compose.yml ps
# Resource usage
docker stats aiserver-llm-server aiserver-llm-server2
# GPU usage
watch -n 1 nvidia-smi
Expected Result: Both containers should operate stably, using different GPUs.