Get the latest tech news

Show HN: NCompass Technologies – yet another AI Inference API, but hear us out

Cost effective AI model inference at scale

When a high number of concurrent requests are run on state-of-the-art serving systems such as vLLM, response times on a single GPU explode catastrophically. We’ve built custom AI inference serving software that can maintain a high quality-of-service on fewer GPUs. Our hardware aware request scheduler and Kubernetes autoscaler enables us to maintain good quality-of-service metrics on 50% fewer GPUs than alternatives.

Get the Android app

Or read this on Hacker News