Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

VerbaAurea/README_EN.md

SquatatHome<jhyuan.cs@gmail.com>

Add PDF Support and Improve Logging System #8 (#8)

faf1cd93

0 commits

PreviewCode viewBlame

📚 VerbaAurea 🌟

中文中文 | 英文 English

VerbaAurea is an intelligent document preprocessing tool dedicated to transforming raw documents into "golden" knowledge, providing high-quality text data for knowledge base construction. It focuses on intelligent document segmentation, ensuring semantic integrity, and delivers premium material for knowledge base retrieval and large language model fine-tuning.

Project Features

Intelligent Document Segmentation - Precise segmentation based on sentence boundaries and semantic integrity, supporting DOCX and PDF format documents
Multi-dimensional Scoring System - Considers titles, sentence integrity, paragraph length, and other factors to determine optimal split points
Semantic Integrity Protection - Prioritizes the completeness of sentences and semantic units, avoiding breaks in the middle of sentences
Web Interface Support - Provides modern web interface with drag-and-drop upload and batch processing
Batch Processing Capability - Supports processing multiple documents simultaneously with unified download
Configurable Design - Flexibly adjust segmentation strategies through configuration files or web interface
Multi-language Support - Employs different sentence segmentation strategies for Chinese and English texts
Format Preservation - Maintains the original document's formatting information, including styles, fonts, and tables

Application Scenarios

Knowledge Base Construction - Provides text units of appropriate granularity for retrieval-based question answering systems
Corpus Preparation - Prepares high-quality training data for large language model fine-tuning
Document Indexing - Optimizes index units for document retrieval systems
Content Management - Improves document organization in content management systems

Project Structure


├── main.py                 # Command-line main program
├── web_service.py          # Web service main program
├── config_manager.py       # Configuration management
├── document_processor.py   # Document processing core
├── pdf_processor.py        # Pdf processing core
├── text_analysis.py        # Text analysis functionality
├── parallel_processor.py   # Parallel processing implementation
├── utils.py                # Utility functions
├── config.json            # Configuration file
├── requirements.txt       # Project dependencies
├── templates/             # Web interface templates
│   └── index.html         # Main page template
├── static/                # Static resources
│   ├── style.css          # Stylesheet
│   └── script.js          # Frontend scripts
├── uploads/               # Upload temporary directory
├── processed/             # Processing results directory
├── README.md              # Chinese documentation
├── README_EN.md           # English documentation
├── LICENSE                # Open source license
└── 启动Web服务.bat        # Windows quick start script

Core Functions

Sentence Boundary Detection - Precisely identifies sentence boundaries by combining rules and NLP techniques
Split Point Scoring System - Multi-dimensional scoring to select optimal split points
Semantic Block Analysis - Analyzes document structure, preserving semantic connections between paragraphs
Adaptive Length Control - Automatically adjusts text fragment length based on configuration
Format Preservation Processing - Maintains the original document format while splitting

Installation Guide

Requirements

Python 3.6 or higher
Supports Windows, macOS, and Linux systems

Installation Steps

Clone the project locally


git clone https://github.com/yourusername/VerbaAurea.git
cd VerbaAurea

Install dependencies


pip install -r requirements.txt

User Guide

Web Interface Usage (Recommended)

Start the web service


python web_service.py

Or double-click the 启动Web服务.bat file on Windows systems

Open your browser and visit http://localhost:18080
Use the web interface for document processing:
- Upload Files: Drag DOCX files to the upload area or click to select files
- Batch Processing: Support uploading multiple files simultaneously
- File Management: Preview uploaded files and remove unwanted files
- Start Processing: Click "Start Processing" button to process all files
- Real-time Progress: View processing progress and status
- Download Results: Download ZIP package after processing completion

Command Line Usage

Place Word documents that need processing in the script directory or subdirectories
Run the main script


python main.py

Select operations according to the menu:
- Select 1 to start processing documents
- Select 2 to view current configuration
- Select 3 to edit configuration
- Select 4 to exit the program
Processed documents will be saved in the processed (default) or custom output folder

Configuration Instructions

You can customize segmentation parameters by editing through the menu or directly modifying the config.json file:

Document Settings

max_length: Maximum paragraph length. Controls the maximum character count of each segmented text block. Too large may reduce retrieval efficiency, too small may disrupt semantic integrity.
min_length: Minimum paragraph length. Prevents the generation of fragments that are too short. Text blocks that are too short may lack sufficient context, affecting knowledge base quality.
sentence_integrity_weight: Sentence integrity weight. Higher values make the system more inclined to keep sentences complete, reducing the possibility of splitting in the middle of sentences.

Processing Options

debug_mode: Debug mode. When enabled, outputs detailed processing information, including split point scoring and calculation processes. (This setting is currently mainly used for algorithm optimization research).
output_folder: Output folder name. Processed documents will be saved in this folder, maintaining the original directory structure.
skip_existing: Whether to skip existing files

Advanced Settings

min_split_score: Minimum split score. Only positions with scores higher than this value will be selected as split points. Increasing this value can reduce the number of split points.
heading_score_bonus: Title bonus value. Splitting before and after titles is usually more reasonable; this parameter controls the priority of title positions.
sentence_end_score_bonus: Sentence ending bonus value. Increasing this value prioritizes splitting at sentence boundaries, improving document semantic integrity.
length_score_factor: Length scoring factor. Controls the impact of paragraph length on scoring; larger values produce more uniform splits.
search_window: Search window size. When adjusting split points to sentence boundaries, the system searches for the nearest sentence boundary within this window range.

Performance Settings

num_workers: Number of worker processes. Setting to 0 will automatically use (CPU cores - 1) processes. Can be adjusted according to system resources.
cache_size: Cache size. Used to store text analysis results to avoid repetitive calculations and improve processing speed. Unit is number of entries.
batch_size: Batch size. The number of files processed by each worker process at once; larger values can reduce process switching overhead.

Best Practices

Set reasonable length ranges - Set appropriate maximum and minimum paragraph lengths based on knowledge base, fine-tuning, or application requirements
Adjust sentence integrity weight - If sentences are being split, increase this weight

How It Works

Document Parsing - Parse documents, extract text, style, and structural information
Paragraph Analysis - Analyze characteristics of each paragraph, such as length, whether it's a heading, whether it ends with a period, etc.
Score Calculation - Calculate comprehensive scores for each potential split point
Split Point Selection - Select optimal split points based on scores and configuration
Sentence Boundary Correction - Adjust split point positions to occur at sentence boundaries
Split Marker Insertion - Insert  markers at selected positions
Format Preservation - Preserve the original document's formatting and save as a new document

Web Interface Features

Modern Interface - Responsive design supporting desktop and mobile devices
Drag & Drop Upload - Support dragging files to upload area
Batch Processing - Process multiple documents at once for improved efficiency
Real-time Progress - Display processing progress and current file
File Management - Preview and manage file list after upload
Configuration Adjustment - Online adjustment of processing parameters
Result Download - Unified packaging and download after processing completion

Development Plan

✅ Web interface support
✅ Batch document processing
✅ Drag & drop upload functionality
🔄 Add support for more document formats
🔄 Enhance semantic analysis capabilities using more advanced NLP models
🔄 Add document preview functionality

Frequently Asked Questions

Q: Why are some paragraphs too short or too long after splitting?

A: Try adjusting the max_length and min_length parameters in the configuration file to balance segmentation granularity.

Q: How to avoid sentences being split in the middle?

A: Increase the sentence_integrity_weight parameter value; the default value is 8.0, you can try setting it to 10.0 or higher.

Q: How to handle documents with special formatting?

A: For special formats, you can adapt to different document structures by adjusting the scoring parameters in the advanced settings.

API Endpoints

Below are the main API endpoints. All return JSON (except the download endpoint):

Health Check
- Method: GET
- Path: /api/health
- Example:
```
curl -s http://localhost:18080/api/health
```

Get/Update Configuration

Method: GET / POST
Path: /api/config

Example (get config):


curl -s http://localhost:18080/api/config

Example (update config):


curl -s -X POST http://localhost:18080/api/config \
  -H "Content-Type: application/json" \
  -d '{"document_settings": {"max_length": 1200, "min_length": 200}}'

Upload File (upload only, no immediate processing)
- Method: POST (multipart/form-data)
- Path: /api/upload
- Params: file (required), session_id (optional, auto-created if omitted)
- Returns: { success, session_id, file_id, original_filename, file_size, message }
- Example:
```
curl -s -F "file=@/path/to/file.docx" http://localhost:18080/api/upload
```
Start Batch Processing
- Method: POST (application/json)
- Path: /api/batch/process
- Body: { "session_id": "<SESSION_ID>" }
- Returns: { success, session_id, processed_count, failed_count, download_url, message }
- Example:
```
curl -s -X POST http://localhost:18080/api/batch/process \
  -H "Content-Type: application/json" \
  -d '{"session_id": "<SESSION_ID>"}'
```
Get Batch Status
- Method: GET
- Path: /api/batch/status/<session_id>
- Returns: { status, progress, processed_count, total_count, current_file, files[], download_url }
- Example:
```
curl -s http://localhost:18080/api/batch/status/<SESSION_ID>
```
Batch Result Download (ZIP)
- Method: GET
- Path: /api/batch/download/<session_id>
- Returns: ZIP file (Content-Disposition attachment)
- Example:
```
curl -L -o result.zip http://localhost:18080/api/batch/download/<SESSION_ID>
```
Remove File From Batch
- Method: POST (application/json)
- Path: /api/batch/remove-file
- Body: { "session_id": "<SESSION_ID>", "file_id": "<FILE_ID>" }
- Returns: { success, message }
- Example:
```
curl -s -X POST http://localhost:18080/api/batch/remove-file \
  -H "Content-Type: application/json" \
  -d '{"session_id": "<SESSION_ID>", "file_id": "<FILE_ID>"}'
```
Single File Download (legacy compatibility)
- Method: GET
- Path: /api/download/<file_id>
- Note: Kept only for legacy compatibility; the new flow uses batch download

Contribution Guidelines

Contributions to the VerbaAurea project are welcome! You can participate in the following ways:

Report bugs or suggest features
Submit Pull Requests to improve the code
Improve documentation and usage examples
Share your experiences and cases using VerbaAurea

Star History

This project is licensed under CC BY 4.0.

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111