Running Two LLM Containers Simultaneously#

This section describes the setup for running two language models simultaneously on one server with multiple GPUs.

When This Is Needed#

Running two LLM containers is useful when:

You have multiple GPUs and want to use them for different models
You need to run different models at the same time (for example, one for chat, another for specialized tasks)
You need to distribute the load between multiple models

Requirements#

A server with at least 2 NVIDIA GPUs
Each GPU must have enough memory for the chosen model
Docker and Docker Compose installed
NVIDIA Container Toolkit installed

Setup#

Step 1: Check Available GPUs#

Make sure you have at least 2 GPUs:

nvidia-smi

Expected Result: You should see at least 2 GPUs in the list.

Step 2: Uncomment the Second LLM Container#

Open the docker-compose.yml file and find the commented block aiserver-llm-server2 (around lines 103-142).

Uncomment the entire block by removing the # characters at the beginning of each line:

# Was:
# aiserver-llm-server2:
#   container_name: aiserver-llm-server2
#   image: aiserver-llm-server:latest
#   ...

# Now:
aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  ...

Step 3: Configure Ports#

Make sure the ports do not conflict:

aiserver-llm-server (first container): port 3003:8000
aiserver-llm-server2 (second container): the port should be different, for example 3006:8000 or 3007:8000

In the uncommented block, check the line:

ports:
  - 3006:8000  # or another free port

Step 4: Configure GPU for Each Container#

It is important to set which GPU will be used by each container.

For the First Container (aiserver-llm-server)

Typically uses GPU 0 (by default). Check the environment variables in the .env file or in the docker-compose.yml itself:

environment:
  LLM_CUDA_VISIBLE_DEVICES: 0  # or do not specify, then GPU 0 will be used

For the Second Container (aiserver-llm-server2)

In the uncommented block, find the line:

environment:
  LLM_CUDA_VISIBLE_DEVICES: 1  # Uses GPU 1

Make sure the value corresponds to the number of the second GPU (usually 1 for the second GPU).

Step 5: Configure Models#

Ensure that each model is configured correctly:

First Container (aiserver-llm-server)

Uses settings from the .env file or default values. Check the variable:

LLM_COMPLETION_MODEL_NAME=/model-store/model-name-1

Second Container (aiserver-llm-server2)

In the uncommented block, find the line:

environment:
  LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"

Change it to the required model if a different one is needed.

Step 6: Check Configuration#

Before running, check the configuration:

# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check that the ports are not occupied
netstat -tuln | grep -E '3003|3006'

Step 7: Start Containers#

# Stop current containers (if running)
docker compose -f docker-compose.yml down

# Start all containers including the second LLM server
docker compose -f docker-compose.yml up -d

# Check that both containers are running
docker compose -f docker-compose.yml ps | grep llm-server

Expected Result: You should see two containers:

aiserver-llm-server (port 3003)
aiserver-llm-server2 (port 3006)

Step 8: Check Operation#

# Check logs of the first container
docker logs aiserver-llm-server

# Check logs of the second container
docker logs aiserver-llm-server2

# Check GPU usage
nvidia-smi

Expected Result:

Both containers should start successfully
In nvidia-smi, processes should be visible on different GPUs
Logs should not contain critical errors

Setting Environment Variables#

If you need to change the settings for the second container, edit the environment block in docker-compose.yml:

aiserver-llm-server2:
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/your-model"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU number (0, 1, 2, etc.)
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
    # ... other settings

Possible Issues#

Container Does Not Start#

Problem: The second container does not start or crashes with an error.

Solution:

Check logs: docker logs aiserver-llm-server2
Ensure GPU is available: nvidia-smi
Check that the port is free: netstat -tuln | grep 3006
Check that the model exists: ls -la llm-server/models/

Port Conflict#

Problem: Error "port is already allocated".

Solution:

Change the port of the second container to a free one (for example, 3007:8000)
Or stop the service occupying the port

Insufficient GPU Memory#

Problem: The model does not load, memory errors.

Solution:

Decrease LLM_GPU_MEMORY_UTILIZATION (for example, to 0.7)
Use smaller models
Free up GPU memory by stopping other processes

Both Containers Use One GPU#

Problem: Both containers use GPU 0 instead of different GPUs.

Solution:

Ensure that LLM_CUDA_VISIBLE_DEVICES is set correctly for each container
Check that the variable is not overridden in the .env file
Restart the containers after changing settings

Example Full Configuration#

Example setup of two containers in docker-compose.yml:

aiserver-llm-server:
  container_name: aiserver-llm-server
  image: aiserver-llm-server:latest
  restart: always
  env_file:
    - .env
  ports:
    - 3003:8000
  environment:
    LLM_CUDA_VISIBLE_DEVICES: 0  # GPU 0
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net

aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  restart: always
  ports:
    - 3006:8000
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU 1
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net

Additional Settings#

Using Different Models#

You can run different models in each container:

# First container - chat model
LLM_COMPLETION_MODEL_NAME: "/model-store/Llama-3-8B"

# Second container - code model
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"

Memory Configuration#

If you have GPUs with different memory sizes, configure memory usage for each container:

# For GPU with less memory
LLM_GPU_MEMORY_UTILIZATION: "0.7"

# For GPU with more memory
LLM_GPU_MEMORY_UTILIZATION: "0.9"

Monitoring#

To monitor the operation of both containers:

# Status of containers
docker compose -f docker-compose.yml ps

# Resource usage
docker stats aiserver-llm-server aiserver-llm-server2

# GPU usage
watch -n 1 nvidia-smi

Expected Result: Both containers should operate stably, using different GPUs.