基于MinerU的智能文档转换服务,支持PDF、Office文档的高质量转换,具备完整的S3集成和异步处理能力。
# 拉取镜像
docker pull docker.cnb.cool/l8ai/document/documentconvert:latest
# 创建数据目录
mkdir -p ./data/{database,logs,workspace}
# 运行容器
docker run -d \
--name document-converter \
-p 33081:8000 \
--gpus all \
-v /raid5/data/document-convert/database:/app/database \
-v /raid5/data/document-convert/logs:/app/log_files \
-v /raid5/data/document-convert/workspace:/app/task_workspace \
-e S3_ENDPOINT=http://your-minio-server:9000 \
-e S3_ACCESS_KEY=your-access-key \
-e S3_SECRET_KEY=your-secret-key \
-e S3_REGION=us-east-1 \
-e DATABASE_TYPE=sqlite \
-e LOG_LEVEL=INFO \
-e MAX_CONCURRENT_TASKS=3 \
docker.cnb.cool/l8ai/document/documentconvert:latest
创建 docker-compose.yml:
version: '3.8'
services:
document-converter:
image: docker.cnb.cool/l8ai/document/documentconvert:latest
container_name: document-converter
ports:
- "8000:8000"
volumes:
- ./data/database:/app/database
- ./data/logs:/app/log_files
- ./data/workspace:/app/task_workspace
environment:
- S3_ENDPOINT=http://your-minio-server:9000
- S3_ACCESS_KEY=your-access-key
- S3_SECRET_KEY=your-secret-key
- S3_REGION=us-east-1
- DATABASE_TYPE=sqlite
- LOG_LEVEL=INFO
- MAX_CONCURRENT_TASKS=3
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
启动服务:
docker-compose up -d
| 变量名 | 描述 | 默认值 | 示例 |
|---|---|---|---|
S3_ENDPOINT | S3/MinIO服务地址 | - | http://minio:9000 |
S3_ACCESS_KEY | S3访问密钥 | - | minioadmin |
S3_SECRET_KEY | S3密钥 | - | minioadmin |
S3_REGION | S3区域 | us-east-1 | us-east-1 |
| 变量名 | 描述 | 默认值 | 示例 |
|---|---|---|---|
DATABASE_TYPE | 数据库类型 | sqlite | sqlite/postgresql |
DATABASE_URL | 数据库连接URL | sqlite:///./database/document_conversion.db | - |
LOG_LEVEL | 日志级别 | INFO | DEBUG/INFO/WARNING |
MAX_CONCURRENT_TASKS | 最大并发任务数 | 3 | 1-10 |
服务启动后,访问 http://localhost:8000/docs 查看完整的API文档。
PDF转Markdown
curl -X POST "http://localhost:8000/api/tasks/create" \
-F "task_type=pdf_to_markdown" \
-F "bucket_name=ai-file" \
-F "file_path=test/document.pdf" \
-F "platform=your-platform" \
-F "priority=high"
Office转PDF
curl -X POST "http://localhost:8000/api/tasks/create" \
-F "task_type=office_to_pdf" \
-F "bucket_name=documents" \
-F "file_path=reports/document.docx" \
-F "platform=your-platform" \
-F "priority=normal"
响应示例:
{
"task_id": 123,
"message": "Document conversion task 123 created successfully",
"status": "pending"
}
curl "http://localhost:8000/api/tasks/123"
响应示例:
{
"id": 123,
"task_type": "pdf_to_markdown",
"status": "completed",
"priority": "high",
"input_path": "/app/task_workspace/task_123/input/document.pdf",
"output_path": "/app/task_workspace/task_123/output/document.md",
"output_url": "s3://ai-file/test/document/markdown/document.md",
"s3_urls": [
"s3://ai-file/test/document/markdown/document.md",
"s3://ai-file/test/document/markdown/document.json",
"s3://ai-file/test/document/markdown/images/image1.jpg"
],
"file_size_bytes": 1048576,
"created_at": "2025-08-09T10:00:00",
"completed_at": "2025-08-09T10:02:30",
"task_processing_time": 150.5,
"result": {
"success": true,
"conversion_type": "pdf_to_markdown",
"upload_result": {
"success": true,
"total_files": 5,
"total_size": 2097152
}
}
}
# 查询所有任务
curl "http://localhost:8000/api/tasks"
# 按状态过滤
curl "http://localhost:8000/api/tasks?status=completed&limit=10"
curl -X POST "http://localhost:8000/api/tasks/123/retry"
curl -X PUT "http://localhost:8000/api/tasks/123/task-type" \
-H "Content-Type: application/json" \
-d '{"new_task_type": "pdf_to_markdown"}'
| 任务类型 | 描述 | 输入格式 | 输出格式 |
|---|---|---|---|
pdf_to_markdown | PDF转Markdown | .pdf | .md + .json + 图片 |
office_to_pdf | Office转PDF | .doc, .docx, .xls, .xlsx, .ppt, .pptx | .pdf |
office_to_markdown | Office转Markdown | Office文档 | .md + 图片 |
| 优先级 | 描述 | 处理顺序 |
|---|---|---|
high | 高优先级 | 优先处理 |
normal | 普通优先级 | 正常处理 |
low | 低优先级 | 最后处理 |
系统遵循以下S3路径规则:
s3://{bucket_name}/{file_path}
s3://ai-file/{original_bucket}/{file_name_without_ext}/{conversion_type}/{output_files}
输入: s3://documents/reports/annual_report.pdf 输出: s3://ai-file/documents/annual_report/markdown/ ├── annual_report.md ├── annual_report.json └── images/ ├── chart1.jpg └── table1.jpg
/app/log_files/app.log/app/log_files/task_{task_id}.logcurl "http://localhost:8000/health"
curl "http://localhost:8000/api/status"
git clone https://cnb.cool/l8ai/document/DocumentConvert.git
cd DocumentConvert
python -m venv venv
source venv/bin/activate # Linux/Mac
# 或
venv\Scripts\activate # Windows
pip install -r requirements.txt
cp .env.example .env
# 编辑 .env 文件,配置S3等参数
python main.py
GPU内存不足
MAX_CONCURRENT_TASKS 值nvidia-smi 监控GPU使用情况S3连接失败
S3_ENDPOINT 是否正确S3_ACCESS_KEY 和 S3_SECRET_KEYLibreOffice转换失败
任务处理缓慢
MAX_CONCURRENT_TASKS 值# 查看应用日志
docker logs document-converter
# 查看特定任务日志
docker exec document-converter cat /app/log_files/task_123.log
MIT License
欢迎提交Issue和Pull Request!
如有问题,请联系技术支持或提交Issue。