Get the latest tech news
How streaming LLM APIs work
I decided to have a poke around and see if I could figure out how the HTTP streaming APIs from the various hosted LLM providers actually worked. Here are my notes so far.
All three of the APIs I investigated worked roughly the same: they return data with a content-type: text/event-stream header, which matches the server-sent events mechanism, then stream blocks separated by\r\n\r\n. The"stream_options": {"include_usage": true} bit requests that the final message in the stream include details of how many input and output tokens were charged while processing the prompt. Google Gemini returns much larger tokens chunks, so I had to prompt "Tell me a very long joke" to get back a streaming response that included multiple parts:
Or read this on Hacker News