┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   CNB Registry │    │   CNB Workspace  │    │   GPU Nodes     │
│                 │    │                  │    │                 │
│ • Image Storage │◄──►│ • Development    │◄──►│ • V100/A100/H20 │
│ • Version Mgmt  │    │ • Testing        │    │ • CUDA 11.7.1   │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                       ┌────────▼────────┐
                       │  Kubernetes     │
                       │  Orchestration  │
                       └─────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        ▼                       ▼                       ▼
┌──────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  DiffDock API │    │  Monitoring     │    │  Load Balancer  │
│               │    │                 │    │                 │
│ • Gradio UI   │    │ • Prometheus    │    │ • Nginx         │
│ • Streamlit   │    │ • Grafana       │    │ • SSL Term      │
│ • REST API    │    │ • AlertManager  │    │ • Auth          │
└──────────────┘    └─────────────────┘    └─────────────────┘

🚀 快速开始

前置要求

CNB平台账户和权限
GPU节点访问权限 (至少1个GPU)
Docker Registry访问权限
kubectl配置 (K8s部署)

1. 克隆仓库


git clone https://cnb.cool/litleung/trainning/diffdock.git
cd diffdock

2. 环境配置


# 复制环境模板
cp .env.example .env

# 编辑环境变量
vim .env

必要的环境变量：


# CNB配置
CNB_REGISTRY=your-registry.cnb.cool
CNB_USERNAME=your-username
CNB_REGISTRY_TOKEN=your-token

# GPU配置
GPU_TYPE=standard  # standard/h20/a100
GPU_COUNT=1

# 服务端口
GRADIO_PORT=7860
STREAMLIT_PORT=8501
API_PORT=8080

3. 构建和部署

Docker Compose部署 (推荐开发/测试)


# 开发环境部署
./scripts/deploy.sh --environment dev --gpu-type standard --build-type compose

# 生产环境部署
./scripts/deploy.sh --environment prod --gpu-type a100 --build-type compose

Kubernetes部署 (生产推荐)


# 创建命名空间
kubectl create namespace diffdock

# 部署到K8s
./scripts/deploy.sh --environment prod --gpu-type a100 --build-type k8s

仅构建镜像


# 构建但不部署
./scripts/deploy.sh --environment dev --gpu-type standard --build-type docker --skip-tests

📦 部署选项详解

方案一：Docker Compose (开发/测试)

适用场景: 本地开发、功能测试、小规模演示
优势: 简单易用，快速启动，资源占用少
资源需求: 单GPU节点即可


# 启动所有服务
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看日志
docker-compose logs -f diffdock-api

# 停止服务
docker-compose down

访问地址:

Gradio Web UI: http://localhost:7860
Streamlit Dashboard: http://localhost:8501
REST API: http://localhost:8080
Grafana监控: http://localhost:3000 (admin/admin123)
Prometheus: http://localhost:9090

方案二：Kubernetes (生产环境)

适用场景: 生产部署、大规模服务、高可用要求
优势: 弹性伸缩、高可用、完善的监控运维
资源需求: GPU集群，建议至少2个节点


# 部署应用
kubectl apply -f k8s/

# 检查部署状态
kubectl get pods -n diffdock
kubectl get svc -n diffdock
kubectl get ingress -n diffdock

# 扩容实例
kubectl scale deployment diffdock-api --replicas=3 -n diffdock

# 查看日志
kubectl logs -f deployment/diffdock-api -n diffdock

访问地址 (通过Ingress):

Gradio Web UI: https://diffdock.your-domain.com
API文档: https://diffdock.your-domain.com/docs

⚙️ 配置说明

主要配置文件

文件路径	用途	说明
`.cnb.yml`	CI/CD流水线	CNB构建、测试、部署流程
`src/diffdock/Dockerfile.optimized`	容器构建	多阶段构建，支持多种GPU
`src/diffdock/environment.optimized.yml`	环境依赖	Conda环境配置
`docker-compose.yml`	本地编排	开发测试环境编排
`k8s/deployment.yml`	K8s部署	生产环境部署配置
`config/app_config.yml`	应用配置	运行时参数配置
`monitoring/prometheus.yml`	监控配置	指标收集和告警

环境变量配置

关键环境变量说明：


# GPU配置
CUDA_VISIBLE_DEVICES=0                    # 可见GPU设备ID
GPU_TYPE=standard                         # GPU类型 (影响优化策略)

# 模型配置
MODEL_CACHE_DIR=/app/models               # 模型缓存目录
DATA_DIR=/app/data                       # 数据存储目录

# 性能调优
OMP_NUM_THREADS=4                         # CPU线程数
MKL_NUM_THREADS=4                         # MKL线程数
BATCH_SIZE=4                              # 推理批次大小
SAMPLES_PER_COMPLEX=10                    # 每复合物采样数

# 日志配置
LOG_LEVEL=INFO                            # 日志级别
LOG_FORMAT=json                           # 日志格式

🧪 测试验证

功能测试


# 运行单元测试
cd src/diffdock
python -m pytest tests/ -v

# 运行推理测试
python inference.py \
  --protein examples/1a46_protein_processed.pdb \
  --ligand examples/1a46_ligand.sdf \
  --out_dir /tmp/test_output

# 验证输出
ls -la /tmp/test_output/
# 应包含: predictions.zip, logs/, metrics.json

性能基准测试


# GPU性能测试
python utils/benchmark.py \
  --protein examples/1a46_protein_processed.pdb \
  --ligand examples/1a46_ligand.sdf \
  --num_runs 10

# 预期性能指标 (A100):
# • 单次推理: ~10-30秒
# • GPU利用率: >80%
# • 内存使用: 8-16GB

API测试


# 健康检查
curl http://localhost:8080/health

# API推理测试
curl -X POST http://localhost:8080/api/v1/dock \
  -H "Content-Type: multipart/form-data" \
  -F "protein=@examples/1a46_protein_processed.pdb" \
  -F "ligand=@examples/1a46_ligand.sdf"

📊 监控运维

监控指标

系统提供以下关键指标：

GPU指标: 利用率、显存使用、温度
应用指标: 推理延迟、吞吐量、错误率
系统指标: CPU、内存、磁盘、网络
业务指标: 请求数、成功率、队列长度

Grafana仪表板

访问Grafana查看实时监控：

系统概览仪表板
GPU性能监控
应用性能指标
业务数据看板

告警配置

预设告警规则：

GPU利用率持续低于50% (5分钟)
推理延迟超过60秒
错误率超过5%
磁盘使用率超过85%

日志管理

日志聚合到ELK Stack：

访问日志: HTTP请求记录
应用日志: 业务操作记录
错误日志: 异常和错误信息
性能日志: 系统性能指标

🔧 故障排除

常见问题

1. GPU不可用

症状: 容器启动失败，提示CUDA错误
排查步骤:


# 检查nvidia-docker
docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu22.04 nvidia-smi

# 检查驱动版本
nvidia-smi

# 检查容器GPU配置
docker inspect <container-id> | grep -i gpu

解决方案:

确保nvidia-container-toolkit已安装
重启Docker服务
检查GPU驱动兼容性

2. 内存不足

症状: OOM Killer终止进程
解决方案:


# 减少批次大小
export BATCH_SIZE=1

# 增加交换空间
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# 调整Docker内存限制
export MEM_LIMIT=32g

3. 模型加载慢

症状: 首次启动耗时过长
解决方案:

预热容器镜像 (提前下载模型)
使用SSD存储模型缓存
配置CDN加速模型下载

4. 端口冲突

症状: 服务无法启动，端口被占用
解决方案:


# 检查端口占用
netstat -tulpn | grep :7860

# 修改端口配置
export GRADIO_PORT=8860
export STREAMLIT_PORT=8601

调试技巧


# 进入容器调试
docker exec -it diffdock-api bash

# 查看实时日志
tail -f logs/diffdock.log

# GPU使用情况
watch -n 1 nvidia-smi

# 网络连接测试
curl -v http://localhost:7860

# 性能分析
python -m cProfile -o profile.stats app/main.py

🔒 安全配置

HTTPS配置


# 生成SSL证书
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout ssl/diffdock.key -out ssl/diffdock.crt

# 配置Nginx使用SSL
cp nginx-ssl.conf nginx.conf
docker-compose restart nginx

API认证


# 启用API密钥认证
export ENABLE_AUTH=true
export API_KEYS="your-secret-key-here"

# 或使用JWT认证
export JWT_SECRET="your-jwt-secret"
export JWT_EXPIRATION=3600

📈 性能优化

GPU优化


# A100优化
export GPU_TYPE=a100
export PRECISION=mixed_float16
export TF32_ENABLED=true

# 多GPU并行
export GPU_COUNT=4
export MULTI_GPU_ENABLED=true

批处理优化


# 动态批处理
export BATCH_SIZE=8
export MAX_BATCH_SIZE=16
export BATCH_TIMEOUT_MS=100

# 异步推理
export ASYNC_INFERENCE=true
export QUEUE_SIZE=100

缓存优化


# Redis缓存
export REDIS_HOST=redis
export REDIS_PORT=6379
export CACHE_TTL_HOURS=24

# 模型缓存
export MODEL_CACHE_ENABLED=true
export DISK_CACHE_DIR=/app/cache

🚢 CI/CD流水线

流水线阶段

Validate: 环境配置验证
Build: Docker镜像构建
Test: 功能测试和性能测试
Security: 安全扫描
Deploy-Staging: 部署到测试环境
Deploy-Production: 部署到生产环境

触发条件

Push to main: 自动部署到staging
Tag v*: 自动部署到production
PR: 运行测试和lint检查

自定义流水线

修改.cnb.yml来自定义流水线：


stages:
  - name: custom-test
    runs_on: gpu
    steps:
      - checkout
      - run:
          name: Custom Tests
          command: ./scripts/custom_tests.sh

🤝 贡献指南

开发环境搭建


# Fork仓库后克隆
git clone https://cnb.cool/your-username/diffdock.git
cd diffdock

# 创建开发环境
./scripts/deploy.sh --environment dev --build-type compose

# 安装开发依赖
pip install -r requirements-dev.txt