MedPsy-4B is a state-of-the-art, text-only medical and healthcare language model purpose-built for edge deployment. Built on top of Qwen3-4B-Thinking-2507 and post-trained with a multi-stage pipeline (supervised fine-tuning + reinforcement learning) on curated medical data, it surpasses models nearly 7x its size on medical benchmarks.
| Developed by | Tether AI Research |
| Model type | Text-only causal language model (decoder-only transformer) |
| Base model | Qwen3-4B-Thinking-2507 |
| Language | English |
| License | Apache 2.0 |
| Technical report | MedPsy Technical Report |
| Collection | MedPsy on Hugging Face |
| All MedPsy variants | MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF |
| MedPsy-4B | MedGemma-27B-text-it | Qwen3-4B-Thinking-2507 | MedGemma-1.5-4B-it | |
|---|---|---|---|---|
| Closed-Ended Medical Benchmarks | ||||
| Average | 70.54 | 69.95 | 63.10 | 51.20 |
| MMLU (Health) | 89.70 | 90.48 | 85.92 | 67.69 |
| AfriMedQA | 71.50 | 73.07 | 64.12 | 54.38 |
| MMLU-Pro Health | 70.45 | 72.94 | 67.73 | 47.31 |
| MedMCQA | 72.15 | 72.77 | 61.78 | 50.08 |
| MedQA (USMLE) | 84.39 | 83.29 | 70.91 | 64.39 |
| MedXpertQA | 30.61 | 25.18 | 16.69 | 15.80 |
| PubMedQA | 75.00 | 71.93 | 74.53 | 58.73 |
| HealthBench | ||||
| Overall | 74.00 | 65.00 | 63.00 | 54.00 |
| Expertise-Tailored Communication | 79.33 | 73.00 | 71.00 | 62.67 |
| Response Depth | 63.67 | 61.33 | 58.00 | 48.67 |
| Context Seeking | 71.67 | 58.67 | 57.67 | 46.00 |
| Emergency Referrals | 81.67 | 73.00 | 74.00 | 64.00 |
| Global Health | 73.67 | 61.00 | 59.00 | 47.67 |
| Health Data Tasks | 60.67 | 56.67 | 54.67 | 44.67 |
| Responding Under Uncertainty | 76.33 | 66.33 | 64.33 | 58.33 |
| HealthBench Hard | ||||
| Overall | 58.00 | 42.00 | 42.67 | 29.67 |
| Expertise-Tailored Communication | 55.33 | 44.67 | 45.00 | 31.67 |
| Response Depth | 47.67 | 38.67 | 38.67 | 29.00 |
| Context Seeking | 63.33 | 42.00 | 43.00 | 28.00 |
| Emergency Referrals | 62.33 | 39.67 | 47.33 | 29.00 |
| Global Health | 60.00 | 42.67 | 43.33 | 29.00 |
| Health Data Tasks | 46.67 | 39.33 | 39.67 | 23.67 |
| Responding Under Uncertainty | 61.00 | 42.67 | 42.00 | 35.00 |
* MMLU (Health): averaged accuracy across 6 sub-domains: anatomy, clinical_knowledge, college_biology, college_medicine, medical_genetics, professional_medicine.
* HealthBench evaluated using CompassJudger-2-32B-Instruct as judge.
* All results are averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384.
Beyond raw accuracy, MedPsy-4B achieves a 3.2x reduction in average response length compared to its backbone model (Qwen3-4B-Thinking-2507). This means faster inference, lower compute costs, and reduced latency - critical for real-time clinical decision support on edge devices.
| Qwen3-4B-Thinking-2507 | MedPsy-4B | |
|---|---|---|
| Avg. Response Length (Tokens) | 2,953 | 909 |
| Δ Reduction | 3.2x fewer tokens | |
The chart below shows per-benchmark response lengths. The largest reductions appear on reasoning-intensive tasks (MedXpertQA, MedQA-USMLE, MMLU-Pro Health), where the base model's extended thinking produces substantially longer outputs without accuracy gains over our post-trained model.
Average response length (tokens) per benchmark. Lower is better. MedPsy-4B consistently produces shorter responses than Qwen3-4B-Thinking-2507 while achieving higher overall accuracy.
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Parameters | 4B |
| Hidden size | 2,560 |
| FFN hidden size | 9,728 |
| Layers | 36 |
| Attention heads | 32 |
| KV groups (GQA) | 8 |
| Vocab size | 151,936 |
| Max position embeddings | 262,144 |
| Precision | bfloat16 |
| Position embedding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "qvac/MedPsy-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [
{"role": "user", "content": "What are the common symptoms and first-line treatments for community-acquired pneumonia?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
The model was post-trained through a multi-stage pipeline on the Qwen3-4B-Thinking-2507 backbone:
For full methodology details, see the MedPsy Technical Report.
MedPsy-4B is an open language model intended as a starting point for developers and researchers building downstream healthcare applications involving medical text. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.
Appropriate use cases include:
Always with appropriate disclaimers.
WARNING
This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-4B is a compact 4B-parameter language model that will make errors. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.
Known limitations include:
When integrating this model into any application:
The model was evaluated on medical safety dimensions through the HealthBench evaluation framework, which assesses Emergency Referrals, Responding Under Uncertainty, and Context Seeking, all critical safety dimensions for medical AI. However, no dedicated red-teaming or adversarial safety testing has been conducted on this model to date. Developers deploying this model in production should conduct their own safety evaluations appropriate to their use case.
@article{medpsy2026,
title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
year={2026},
url={https://huggingface.co/blog/qvac/medpsy}
institution={Tether AI Research}
}
We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.
This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-4B-Thinking-2507, which is also under the Apache 2.0 license.
As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model—which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.