Interactive visualization and exploration of scientific papers from the Aella open science dataset.
This project is a collaboration between Inference.net and LAION. LAION curated the original dataset which is about ~100m scrapped scientific and research articles and Inference.net fine-tuned a custom model to extract structured summaries from the articles. This repo contains a visual explorer for a small subset of the extracted dataset.
This will install both backend and frontend dependencies.
Quick Start
1. Get the Database
Download the database from R2:
task db:setup
This will download the SQLite database to backend/data/db.sqlite.
2. Run the Application
Run the backend and frontend in separate terminals:
Backend (Terminal 1):
task backend:dev
Frontend (Terminal 2):
task frontend:dev
The application will be available at:
Frontend: http://localhost:5173
API: http://localhost:8787
API Docs: http://localhost:8787/docs
Data Pipeline
The code for the data pipeline that we used to construct this dataset is not yet open source, mostly because it was setup for a one-time process and not production-ready.
However, the general process was:
Initial data extraction and filtering
Ran a pipeline to generate the summaries
Excluded specific non-scientific content and failed summaries
Compiled results for further processing
Semantic Embedding
Generates 768-dimensional embeddings using SPECTER2 (allenai/specter2_base)
Processes papers in batches with GPU acceleration support
Stores embeddings as binary blobs for similarity search
Visualization & Clustering
Reduces embeddings to 2D coordinates using UMAP with cosine distance
Applies K-Means clustering with automatic optimization (20-60 clusters via silhouette scores)
Generates initial cluster labels using TF-IDF analysis of titles and fields
https://inference.net/
https://laion.inference.net/embeddings
LAION research paper dataset visual explorer 🔬 🧑🔬 👩🔬
https://github.com/context-labs/aella-data-explorer.git