Running Two LLM Containers Simultaneously#

This section describes the setup for running two language models simultaneously on one server with multiple GPUs.

When This Is Needed#

Running two LLM containers is useful when:

  • You have multiple GPUs and want to use them for different models
  • You need to run different models at the same time (for example, one for chat, another for specialized tasks)
  • You need to distribute the load between multiple models

Requirements#

  • A server with at least 2 NVIDIA GPUs
  • Each GPU must have enough memory for the chosen model
  • Docker and Docker Compose installed
  • NVIDIA Container Toolkit installed

Setup#

Step 1: Check Available GPUs#

Make sure you have at least 2 GPUs:

nvidia-smi

Expected Result: You should see at least 2 GPUs in the list.

Step 2: Uncomment the Second LLM Container#

Open the docker-compose.yml file and find the commented block aiserver-llm-server2 (around lines 103-142).

Uncomment the entire block by removing the # characters at the beginning of each line:

# Was:
# aiserver-llm-server2:
#   container_name: aiserver-llm-server2
#   image: aiserver-llm-server:latest
#   ...

# Now:
aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  ...

Step 3: Configure Ports#

Make sure the ports do not conflict:

  • aiserver-llm-server (first container): port 3003:8000
  • aiserver-llm-server2 (second container): the port should be different, for example 3006:8000 or 3007:8000

In the uncommented block, check the line:

ports:
  - 3006:8000  # or another free port

Step 4: Configure GPU for Each Container#

It is important to set which GPU will be used by each container.

For the First Container (aiserver-llm-server)

Typically uses GPU 0 (by default). Check the environment variables in the .env file or in the docker-compose.yml itself:

environment:
  LLM_CUDA_VISIBLE_DEVICES: 0  # or do not specify, then GPU 0 will be used

For the Second Container (aiserver-llm-server2)

In the uncommented block, find the line:

environment:
  LLM_CUDA_VISIBLE_DEVICES: 1  # Uses GPU 1

Make sure the value corresponds to the number of the second GPU (usually 1 for the second GPU).

Step 5: Configure Models#

Ensure that each model is configured correctly:

First Container (aiserver-llm-server)

Uses settings from the .env file or default values. Check the variable:

LLM_COMPLETION_MODEL_NAME=/model-store/model-name-1

Second Container (aiserver-llm-server2)

In the uncommented block, find the line:

environment:
  LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"

Change it to the required model if a different one is needed.

Step 6: Check Configuration#

Before running, check the configuration:

# Check the syntax of the docker-compose file
docker compose -f docker-compose.yml config

# Check that the ports are not occupied
netstat -tuln | grep -E '3003|3006'

Step 7: Start Containers#

# Stop current containers (if running)
docker compose -f docker-compose.yml down

# Start all containers including the second LLM server
docker compose -f docker-compose.yml up -d

# Check that both containers are running
docker compose -f docker-compose.yml ps | grep llm-server

Expected Result: You should see two containers:

  • aiserver-llm-server (port 3003)
  • aiserver-llm-server2 (port 3006)

Step 8: Check Operation#

# Check logs of the first container
docker logs aiserver-llm-server

# Check logs of the second container
docker logs aiserver-llm-server2

# Check GPU usage
nvidia-smi

Expected Result:

  • Both containers should start successfully
  • In nvidia-smi, processes should be visible on different GPUs
  • Logs should not contain critical errors

Setting Environment Variables#

If you need to change the settings for the second container, edit the environment block in docker-compose.yml:

aiserver-llm-server2:
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/your-model"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU number (0, 1, 2, etc.)
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
    # ... other settings

Possible Issues#

Container Does Not Start#

Problem: The second container does not start or crashes with an error.

Solution:

  1. Check logs: docker logs aiserver-llm-server2
  2. Ensure GPU is available: nvidia-smi
  3. Check that the port is free: netstat -tuln | grep 3006
  4. Check that the model exists: ls -la llm-server/models/

Port Conflict#

Problem: Error "port is already allocated".

Solution:

  • Change the port of the second container to a free one (for example, 3007:8000)
  • Or stop the service occupying the port

Insufficient GPU Memory#

Problem: The model does not load, memory errors.

Solution:

  • Decrease LLM_GPU_MEMORY_UTILIZATION (for example, to 0.7)
  • Use smaller models
  • Free up GPU memory by stopping other processes

Both Containers Use One GPU#

Problem: Both containers use GPU 0 instead of different GPUs.

Solution:

  • Ensure that LLM_CUDA_VISIBLE_DEVICES is set correctly for each container
  • Check that the variable is not overridden in the .env file
  • Restart the containers after changing settings

Example Full Configuration#

Example setup of two containers in docker-compose.yml:

aiserver-llm-server:
  container_name: aiserver-llm-server
  image: aiserver-llm-server:latest
  restart: always
  env_file:
    - .env
  ports:
    - 3003:8000
  environment:
    LLM_CUDA_VISIBLE_DEVICES: 0  # GPU 0
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net

aiserver-llm-server2:
  container_name: aiserver-llm-server2
  image: aiserver-llm-server:latest
  restart: always
  ports:
    - 3006:8000
  environment:
    LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"
    LLM_CUDA_VISIBLE_DEVICES: 1  # GPU 1
    LLM_TENSOR_PARALLEL_SIZE: "1"
    LLM_MAX_MODEL_LEN: "16000"
    LLM_GPU_MEMORY_UTILIZATION: "0.85"
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [ gpu ]
  volumes:
    - "./llm-server/models:/model-store"
  networks:
    - llm-net

Additional Settings#

Using Different Models#

You can run different models in each container:

# First container - chat model
LLM_COMPLETION_MODEL_NAME: "/model-store/Llama-3-8B"

# Second container - code model
LLM_COMPLETION_MODEL_NAME: "/model-store/Qwen3-30B-A3B-AWQ"

Memory Configuration#

If you have GPUs with different memory sizes, configure memory usage for each container:

# For GPU with less memory
LLM_GPU_MEMORY_UTILIZATION: "0.7"

# For GPU with more memory
LLM_GPU_MEMORY_UTILIZATION: "0.9"

Monitoring#

To monitor the operation of both containers:

# Status of containers
docker compose -f docker-compose.yml ps

# Resource usage
docker stats aiserver-llm-server aiserver-llm-server2

# GPU usage
watch -n 1 nvidia-smi

Expected Result: Both containers should operate stably, using different GPUs.