VerbaAurea is an intelligent document preprocessing tool dedicated to transforming raw documents into "golden" knowledge, providing high-quality text data for knowledge base construction. It focuses on intelligent document segmentation, ensuring semantic integrity, and delivers premium material for knowledge base retrieval and large language model fine-tuning.
Knowledge Base Construction - Provides text units of appropriate granularity for retrieval-based question answering systems
Corpus Preparation - Prepares high-quality training data for large language model fine-tuning
Document Indexing - Optimizes index units for document retrieval systems
Content Management - Improves document organization in content management systems
├── main.py # Command-line main program ├── web_service.py # Web service main program ├── config_manager.py # Configuration management ├── document_processor.py # Document processing core ├── pdf_processor.py # Pdf processing core ├── text_analysis.py # Text analysis functionality ├── parallel_processor.py # Parallel processing implementation ├── utils.py # Utility functions ├── config.json # Configuration file ├── requirements.txt # Project dependencies ├── templates/ # Web interface templates │ └── index.html # Main page template ├── static/ # Static resources │ ├── style.css # Stylesheet │ └── script.js # Frontend scripts ├── uploads/ # Upload temporary directory ├── processed/ # Processing results directory ├── README.md # Chinese documentation ├── README_EN.md # English documentation ├── LICENSE # Open source license └── 启动Web服务.bat # Windows quick start script
git clone https://github.com/yourusername/VerbaAurea.git
cd VerbaAurea
pip install -r requirements.txt
python web_service.py
Or double-click the 启动Web服务.bat file on Windows systems
Open your browser and visit http://localhost:18080
Use the web interface for document processing:
python main.py
Select operations according to the menu:
1 to start processing documents2 to view current configuration3 to edit configuration4 to exit the programProcessed documents will be saved in the processed (default) or custom output folder
You can customize segmentation parameters by editing through the menu or directly modifying the config.json file:
max_length: Maximum paragraph length. Controls the maximum character count of each segmented text block. Too large may reduce retrieval efficiency, too small may disrupt semantic integrity.min_length: Minimum paragraph length. Prevents the generation of fragments that are too short. Text blocks that are too short may lack sufficient context, affecting knowledge base quality.sentence_integrity_weight: Sentence integrity weight. Higher values make the system more inclined to keep sentences complete, reducing the possibility of splitting in the middle of sentences.debug_mode: Debug mode. When enabled, outputs detailed processing information, including split point scoring and calculation processes. (This setting is currently mainly used for algorithm optimization research).output_folder: Output folder name. Processed documents will be saved in this folder, maintaining the original directory structure.skip_existing: Whether to skip existing filesmin_split_score: Minimum split score. Only positions with scores higher than this value will be selected as split points. Increasing this value can reduce the number of split points.heading_score_bonus: Title bonus value. Splitting before and after titles is usually more reasonable; this parameter controls the priority of title positions.sentence_end_score_bonus: Sentence ending bonus value. Increasing this value prioritizes splitting at sentence boundaries, improving document semantic integrity.length_score_factor: Length scoring factor. Controls the impact of paragraph length on scoring; larger values produce more uniform splits.search_window: Search window size. When adjusting split points to sentence boundaries, the system searches for the nearest sentence boundary within this window range.num_workers: Number of worker processes. Setting to 0 will automatically use (CPU cores - 1) processes. Can be adjusted according to system resources.cache_size: Cache size. Used to store text analysis results to avoid repetitive calculations and improve processing speed. Unit is number of entries.batch_size: Batch size. The number of files processed by each worker process at once; larger values can reduce process switching overhead.<!--split--> markers at selected positionsQ: Why are some paragraphs too short or too long after splitting?
A: Try adjusting the max_length and min_length parameters in the configuration file to balance segmentation granularity.
Q: How to avoid sentences being split in the middle?
A: Increase the sentence_integrity_weight parameter value; the default value is 8.0, you can try setting it to 10.0 or higher.
Q: How to handle documents with special formatting?
A: For special formats, you can adapt to different document structures by adjusting the scoring parameters in the advanced settings.
Below are the main API endpoints. All return JSON (except the download endpoint):
Health Check
curl -s http://localhost:18080/api/health
Get/Update Configuration
curl -s http://localhost:18080/api/config
curl -s -X POST http://localhost:18080/api/config \
-H "Content-Type: application/json" \
-d '{"document_settings": {"max_length": 1200, "min_length": 200}}'
Upload File (upload only, no immediate processing)
curl -s -F "file=@/path/to/file.docx" http://localhost:18080/api/upload
Start Batch Processing
curl -s -X POST http://localhost:18080/api/batch/process \
-H "Content-Type: application/json" \
-d '{"session_id": "<SESSION_ID>"}'
Get Batch Status
curl -s http://localhost:18080/api/batch/status/<SESSION_ID>
Batch Result Download (ZIP)
curl -L -o result.zip http://localhost:18080/api/batch/download/<SESSION_ID>
Remove File From Batch
curl -s -X POST http://localhost:18080/api/batch/remove-file \
-H "Content-Type: application/json" \
-d '{"session_id": "<SESSION_ID>", "file_id": "<FILE_ID>"}'
Single File Download (legacy compatibility)
Contributions to the VerbaAurea project are welcome! You can participate in the following ways:
This project is licensed under CC BY 4.0.