LLM Streaming
@humanspeak/svelte-markdown handles real-time streaming from Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and other AI assistants out of the box. As tokens arrive via Server-Sent Events (SSE) or WebSocket connections, simply append them to the source prop and the rendered markdown updates instantly.
How It Works
LLM APIs stream responses token-by-token. Each token is a small chunk of text — sometimes a word, sometimes a partial word, sometimes punctuation or whitespace. The typical integration pattern is:
- The LLM API sends tokens via Server-Sent Events (SSE) or a streaming HTTP response.
- Your application accumulates tokens into a growing markdown string.
SvelteMarkdownre-parses and re-renders the full source on each update.- Svelte’s fine-grained reactivity ensures only the changed DOM nodes are updated.
Because the component is reactive by default, there is no special “streaming mode” to enable. It just works.
Basic Usage
With the Anthropic SDK (Claude)
<script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
import Anthropic from '@anthropic-ai/sdk'
let source = $state('')
async function streamResponse(prompt) {
const client = new Anthropic()
const stream = client.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }]
})
for await (const event of stream) {
if (
event.type === 'content_block_delta' &&
event.delta.type === 'text_delta'
) {
source += event.delta.text
}
}
}
</script>
<SvelteMarkdown {source} /><script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
import Anthropic from '@anthropic-ai/sdk'
let source = $state('')
async function streamResponse(prompt) {
const client = new Anthropic()
const stream = client.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }]
})
for await (const event of stream) {
if (
event.type === 'content_block_delta' &&
event.delta.type === 'text_delta'
) {
source += event.delta.text
}
}
}
</script>
<SvelteMarkdown {source} />With the OpenAI SDK (ChatGPT)
<script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
import OpenAI from 'openai'
let source = $state('')
async function streamResponse(prompt) {
const client = new OpenAI()
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
stream: true
})
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content
if (delta) {
source += delta
}
}
}
</script>
<SvelteMarkdown {source} /><script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
import OpenAI from 'openai'
let source = $state('')
async function streamResponse(prompt) {
const client = new OpenAI()
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
stream: true
})
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content
if (delta) {
source += delta
}
}
}
</script>
<SvelteMarkdown {source} />With fetch and Server-Sent Events
<script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
let source = $state('')
async function streamFromAPI(prompt) {
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
headers: { 'Content-Type': 'application/json' }
})
if (!response.ok) throw new Error(`HTTP ${response.status}`)
const reader = response.body.getReader()
const decoder = new TextDecoder()
try {
while (true) {
const { done, value } = await reader.read()
if (done) break
source += decoder.decode(value, { stream: true })
}
} finally {
reader.releaseLock()
}
}
</script>
<SvelteMarkdown {source} /><script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
let source = $state('')
async function streamFromAPI(prompt) {
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
headers: { 'Content-Type': 'application/json' }
})
if (!response.ok) throw new Error(`HTTP ${response.status}`)
const reader = response.body.getReader()
const decoder = new TextDecoder()
try {
while (true) {
const { done, value } = await reader.read()
if (done) break
source += decoder.decode(value, { stream: true })
}
} finally {
reader.releaseLock()
}
}
</script>
<SvelteMarkdown {source} />Performance Characteristics
We measured render performance across different streaming speeds and chunking strategies using the interactive streaming demo:
| Streaming Speed | Chunk Mode | Avg Render | Peak Render | Dropped Frames |
|---|---|---|---|---|
| 30 words/sec | Word | ~3ms | ~11ms | 0 |
| 100 chars/sec | Character | ~4ms | ~21ms | 0 |
| 50 words/sec | Word | ~3ms | ~12ms | 0 |
All render times stay well under the 16.7ms frame budget (60fps), meaning the browser has time to paint every frame without jank. Even at 100 characters per second in character mode (a worst-case scenario far beyond real LLM speeds), average render time remains under 5ms.
How Render Time Scales
Render time grows linearly with document length because the full markdown source is re-parsed on each update. For a typical LLM response (~2,000 characters), the overhead is negligible:
- 0-500 chars: <1ms per render
- 500-1,000 chars: ~2-3ms per render
- 1,000-2,000 chars: ~5-7ms per render
- 2,000+ chars: ~7-10ms per render
For very long documents (10,000+ characters), consider the optimization strategies below.
Best Practices
1. Prefer Word-Level Chunking
If you control the chunking strategy (e.g., in a custom SSE endpoint), emit tokens at word boundaries rather than individual characters. This reduces the total number of re-renders while producing the same visual result:
// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
buffer += token
if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
controller.enqueue(encoder.encode(buffer))
buffer = ''
}
}
if (buffer) controller.enqueue(encoder.encode(buffer))// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
buffer += token
if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
controller.enqueue(encoder.encode(buffer))
buffer = ''
}
}
if (buffer) controller.enqueue(encoder.encode(buffer))2. Use Token Caching for Chat History
When displaying a conversation with multiple messages, previously completed messages are re-rendered on every update. Enable token caching so that completed messages skip re-parsing:
<script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
let messages = $state([])
</script>
{#each messages as message}
<!-- Completed messages hit the token cache automatically -->
<SvelteMarkdown source={message.content} />
{/each}<script>
import SvelteMarkdown from '@humanspeak/svelte-markdown'
let messages = $state([])
</script>
{#each messages as message}
<!-- Completed messages hit the token cache automatically -->
<SvelteMarkdown source={message.content} />
{/each}3. Debounce for Extremely Fast Streams
If your LLM stream is unusually fast (100+ tokens/second) and you notice frame drops, you can batch updates using requestAnimationFrame:
let pending = ''
let rafScheduled = false
function onToken(token) {
pending += token
if (!rafScheduled) {
rafScheduled = true
requestAnimationFrame(() => {
source += pending
pending = ''
rafScheduled = false
})
}
}let pending = ''
let rafScheduled = false
function onToken(token) {
pending += token
if (!rafScheduled) {
rafScheduled = true
requestAnimationFrame(() => {
source += pending
pending = ''
rafScheduled = false
})
}
}This coalesces multiple tokens into a single render per frame, reducing total renders from ~100/sec to ~60/sec while maintaining smooth visual output.
Estimating Streaming Costs
When building production LLM streaming UIs, understanding token costs is as important as render performance. Each streamed token has a price that varies by model, provider, and whether it’s an input or output token. ModelPricing.ai provides a pricing estimation API that covers all major LLM providers — useful for displaying real-time cost tracking alongside your streamed responses, setting usage budgets, or building cost-aware model selection into your application.
Try It Live
Experiment with different streaming speeds, jitter, and chunk modes in the interactive LLM streaming demo.
Related
- Token Caching — cache parsed tokens for instant re-renders
- Getting Started — installation and basic usage
- SvelteMarkdown API — full prop reference
- ModelPricing.ai — LLM pricing estimation API for cost tracking