Get the latest tech news
Show HN: NCompass Technologies – yet another AI Inference API, but hear us out
Cost effective AI model inference at scale
When a high number of concurrent requests are run on state-of-the-art serving systems such as vLLM, response times on a single GPU explode catastrophically. We’ve built custom AI inference serving software that can maintain a high quality-of-service on fewer GPUs. Our hardware aware request scheduler and Kubernetes autoscaler enables us to maintain good quality-of-service metrics on 50% fewer GPUs than alternatives.
Or read this on Hacker News