logo
1
0
WeChat Login
drunkpig<60862764+drunkpig@users.noreply.github.com>
Realese 0.8.0 (#586)

Installation

MinerU

git clone https://github.com/opendatalab/MinerU.git cd MinerU conda create -n MinerU python=3.10 conda activate MinerU pip install .[full] --extra-index-url https://wheels.myhloli.com

Third-party software

# install pip install llama-index-vector-stores-elasticsearch==0.2.0 pip install llama-index-embeddings-dashscope==0.2.0 pip install llama-index-core==0.10.68 pip install einops==0.7.0 pip install transformers-stream-generator==0.0.5 pip install accelerate==0.33.0 # uninstall pip uninstall transformer-engine

Environment Configuration

export DASHSCOPE_API_KEY={some_key} export ES_USER={some_es_user} export ES_PASSWORD={some_es_password} export ES_URL=http://{es_url}:9200

For instructions on obtaining a DASHSCOPE_API_KEY, refer to documentation

Usage

Data Ingestion

python data_ingestion.py -p some.pdf # load data from pdf or python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}

Query

python query.py --question '{the_question_you_want_to_ask}'

Example

# Start the es service docker compose up -d or docker-compose up -d # Set environment variables export ES_USER=elastic export ES_PASSWORD=llama_index export ES_URL=http://127.0.0.1:9200 export DASHSCOPE_API_KEY={some_key} # Ingest data python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf # Ask a question python query.py -q 'how about the rights of men' ## outputs Please answer the question based on the content within ```: ``` I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility. ``` My question is:how about the rights of men。 question: how about the rights of men answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.

Development

MinerU provides a RAG integration interface, allowing users to specify a single input pdf file or a directory. MinerU will automatically parse the input files and return an iterable interface for retrieving the data.

API Interface

from magic_pdf.integrations.rag.type import Node class RagPageReader: def get_rel_map(self) -> list[ElementRelation]: # Retrieve the relationships between nodes pass ... class RagDocumentReader: ... class DataReader: def __init__(self, path_or_directory: str, method: str, output_dir: str): pass def get_documents_count(self) -> int: """Get the number of pdf documents""" pass def get_document_result(self, idx: int) -> RagDocumentReader | None: """Retrieve the parsed content of a specific pdf""" pass def get_document_filename(self, idx: int) -> Path: """Retrieve the path of a specific pdf""" pass

Type Definitions

class Node(BaseModel): category_type: CategoryType = Field(description='Category') # Category text: str | None = Field(description='Text content', default=None) image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None) anno_id: int = Field(description='Unique ID', default=-1) latex: str | None = Field(description='LaTeX output for equations or tables', default=None) html: str | None = Field(description='HTML output for tables', default=None)

Tables can be stored in one of three formats: image, LaTeX, or HTML. anno_id is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the get_rel_map method. Users can use anno_id to link nodes and construct a RAG index that includes node relationships.

Node Relationship Matrix

image_bodytable_body
image_captionsibling
table_captionsibling
table_footnotesibling