TLDR:
- Tether 1.7B model beat Google’s MedGemma-4B by 11+ points despite being less than half the size.
- The 4B model cuts token output by 3.2x, reducing compute costs and enabling faster on-device responses.
- QVAC MedPsy runs fully local in GGUF format, keeping sensitive patient data off remote cloud servers.
- Models were tested across eight benchmarks, including clinical exams, expert reasoning, and real-world scenarios.
Tether’s AI Research Group has released QVAC MedPsy, a new line of medical language models built to run on smartphones and edge devices.
The models are designed for privacy-first deployment, keeping sensitive health data local. Early benchmark results show the smaller models outperforming much larger competitors.
This marks a shift in how medical AI systems can be structured and deployed.
Compact Models Delivering Strong Benchmark Results
QVAC MedPsy comes in two versions: a 1.7 billion and a 4 billion parameter model. Both were tested across eight medical benchmark suites covering clinical knowledge, expert reasoning, and real-world scenarios. The results were notably competitive against models many times their size.
The 1.7 billion model scored 62.62 across seven closed-ended benchmarks. That score beat Google’s MedGemma-4B by over 11 points, despite being less than half its size. On HealthBench Hard, the same model also outperformed MedGemma 27B, which is nearly sixteen times larger.
The 4 billion version scored 70.54 on those same seven benchmarks. It exceeded MedGemma-27B-text and other models nearly seven times its size. Performance held strong across HealthBench, HealthBench Hard, and MedXpertQA evaluations.
Tether’s CEO Paolo Ardoino addressed the efficiency directly. “Our 4 billion model exceeded results from models nearly seven times its size, while using up to three times fewer tokens per response,” he said.
Efficiency and Local Deployment Drive the Release
Token efficiency is one of the most practical outcomes of this release. The 4 billion model generates responses in around 909 tokens. Comparable systems use roughly 2,953 tokens per response, making this a 3.2x reduction in output length.
The 1.7 billion model averages about 1,110 tokens per response, versus 1,901 for similar systems. Shorter outputs mean faster response times and lower compute costs. That matters in real-world healthcare settings where speed and cost both affect adoption.
Both models are available in quantized GGUF format for local deployment. The Q4_K_M versions are approximately 1.2 GB and 2.6 GB, respectively. These sizes make the models practical for mobile devices and on-site hospital systems.
The performance gains come from a staged post-training process. It combines broad medical supervision, clinical reasoning data, and reinforcement learning on harder cases. No additional model scaling was required to reach these results.
Medical AI has long depended on cloud infrastructure to process sensitive data remotely. QVAC MedPsy changes that by making strong performance available entirely on-device.
For healthcare providers operating under strict privacy rules, this opens new deployment options where cloud access is limited or restricted.



