Docker GPU Setup and CI/CD Workflows¶
Overview¶
This guide covers the GPU-enabled Docker setup for OpenContracts and the GitHub Actions workflows for building and publishing container images.
GPU Support¶
Prerequisites¶
- NVIDIA GPU with CUDA support
- NVIDIA Container Toolkit installed on your host system
- Docker with GPU support enabled
Windows Setup (WSL2)¶
# Install WSL2
wsl --install
# Install NVIDIA drivers for WSL2
# Download from: https://developer.nvidia.com/cuda/wsl
# Verify GPU is accessible
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
Linux Setup¶
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Verify GPU is accessible
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
Docker Images¶
Base Images¶
All Django-based services now use the PyTorch CUDA base image: - Base: pytorch/pytorch:2.7.1-cuda12.6-cudnn9-runtime
- Includes: PyTorch 2.7.1, CUDA 12.6, cuDNN 9
Environment Variables¶
The following CUDA-specific environment variables are configured: - CUDA_MODULE_LOADING=LAZY
- Improves startup time - TORCH_CUDA_ARCH_LIST
- Supports GPU architectures 6.0-9.0 - CUDA_VISIBLE_DEVICES=0
- Default to first GPU - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
- Memory optimization
GitHub Workflows¶
1. Tagged Release Workflow (docker-build-release.yml
)¶
Automatically builds and publishes images when a new release is created.
Triggers on: - Release publication
Images built: - ghcr.io/[owner]/opencontractserver_django:[version]
- Used by django, celeryworker, celerybeat, and flower services - ghcr.io/[owner]/opencontractserver_frontend:[version]
- ghcr.io/[owner]/opencontractserver_postgres:[version]
- ghcr.io/[owner]/opencontractserver_traefik:[version]
2. CUDA Build Workflow (docker-build-cuda.yml
)¶
Builds GPU-enabled images with CUDA support.
Triggers on: - Manual workflow dispatch - Push to main branch (when Dockerfiles change)
Images built: - ghcr.io/[owner]/opencontractserver_django:cuda-latest
- GPU-enabled image used by django, celeryworker, celerybeat, and flower services
Using Pre-built Images¶
Production Deployment with Pre-built Images¶
Use production-ghcr.yml
to deploy with pre-built images:
# Using latest images
GITHUB_REPOSITORY_OWNER=yourusername docker-compose -f production-ghcr.yml up -d
# Using specific version
GITHUB_REPOSITORY_OWNER=yourusername TAG=v1.2.3 docker-compose -f production-ghcr.yml up -d
# Using CUDA-enabled images
GITHUB_REPOSITORY_OWNER=yourusername TAG=cuda-latest docker-compose -f production-ghcr.yml up -d
Local Development¶
For local development, continue using the standard compose files:
# Build locally
docker-compose -f local.yml build
# Run with GPU support
docker-compose -f local.yml up
Verifying GPU Support¶
Check GPU availability in container¶
# Enter django container
docker-compose -f local.yml exec django bash
# Inside container, run Python
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"
Monitor GPU usage¶
# On host system
nvidia-smi -l 1 # Updates every second
Troubleshooting¶
GPU not detected¶
-
Verify NVIDIA Container Toolkit installation:
nvidia-container-cli info
-
Check Docker daemon configuration:
docker info | grep nvidia
-
Ensure GPU is not being used by other processes:
nvidia-smi
Out of memory errors¶
Adjust PYTORCH_CUDA_ALLOC_CONF
:
environment:
- PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 # Reduce if OOM
Performance optimization¶
For multi-GPU systems, specify which GPUs to use:
environment:
- CUDA_VISIBLE_DEVICES=0,1 # Use first two GPUs
CI/CD Best Practices¶
- Tag releases properly: Use semantic versioning (e.g., v1.2.3)
- Test locally first: Build and test images locally before pushing
- Monitor builds: Check GitHub Actions for build status
- Use specific tags: Avoid using
latest
in production - Clean up old images: Periodically remove unused images from the registry