
A powerful tool for creating fine-tuning datasets for Large Language Models
Features • Quick Start • Documentation • Contributing • License
If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!
Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
| Windows | MacOS | Linux | |
Setup.exe |
Intel |
M |
AppImage |
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
npm install
npm run build npm run start
http://localhost:1717git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
docker-compose.yml file:services:
easy-dataset:
image: ghcr.io/conardli/easy-dataset
container_name: easy-dataset
ports:
- '1717:1717'
volumes:
- ${LOCAL_DB_PATH}:/app/local-db
- ${LOCAL_PRISMA_PATH}:/app/prisma
restart: unless-stopped
Note: Replace
{YOUR_LOCAL_DB_PATH}and{LOCAL_PRISMA_PATH}with the actual paths where you want to store the local database. It is recommended to use thelocal-dbandprismafolders in the current code repository directory to maintain consistency with the database paths when starting via NPM.
docker-compose up -d
http://localhost:1717If you want to build the image yourself, use the Dockerfile in the project root directory:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
docker build -t easy-dataset .
docker run -d \ -p 1717:1717 \ -v {YOUR_LOCAL_DB_PATH}:/app/local-db \ -v {LOCAL_PRISMA_PATH}:/app/prisma \ --name easy-dataset \ easy-dataset
Note: Replace
{YOUR_LOCAL_DB_PATH}and{LOCAL_PRISMA_PATH}with the actual paths where you want to store the local database. It is recommended to use thelocal-dbandprismafolders in the current code repository directory to maintain consistency with the database paths when starting via NPM.
http://localhost:1717![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
git checkout -b feature/amazing-feature)git commit -m 'Add some amazing feature')git push origin feature/amazing-feature)Please ensure that tests are appropriately updated and adhere to the existing coding style.
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.
If this work is helpful, please kindly cite as:
@misc{miao2025easydataset, title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents}, author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang}, year={2025}, eprint={2507.04009}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.04009} }