logo

LLM Streaming

@humanspeak/svelte-markdown handles real-time streaming from Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and other AI assistants out of the box. As tokens arrive via Server-Sent Events (SSE) or WebSocket connections, simply append them to the source prop and the rendered markdown updates instantly.

How It Works

LLM APIs stream responses token-by-token. Each token is a small chunk of text — sometimes a word, sometimes a partial word, sometimes punctuation or whitespace. The typical integration pattern is:

  1. The LLM API sends tokens via Server-Sent Events (SSE) or a streaming HTTP response.
  2. Your application accumulates tokens into a growing markdown string.
  3. SvelteMarkdown re-parses and re-renders the full source on each update.
  4. Svelte’s fine-grained reactivity ensures only the changed DOM nodes are updated.

Because the component is reactive by default, there is no special “streaming mode” to enable. It just works.

Basic Usage

With the Anthropic SDK (Claude)

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import Anthropic from '@anthropic-ai/sdk'

    let source = $state('')

    async function streamResponse(prompt) {
        const client = new Anthropic()

        const stream = client.messages.stream({
            model: 'claude-sonnet-4-20250514',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }]
        })

        for await (const event of stream) {
            if (
                event.type === 'content_block_delta' &&
                event.delta.type === 'text_delta'
            ) {
                source += event.delta.text
            }
        }
    }
</script>

<SvelteMarkdown {source} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import Anthropic from '@anthropic-ai/sdk'

    let source = $state('')

    async function streamResponse(prompt) {
        const client = new Anthropic()

        const stream = client.messages.stream({
            model: 'claude-sonnet-4-20250514',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }]
        })

        for await (const event of stream) {
            if (
                event.type === 'content_block_delta' &&
                event.delta.type === 'text_delta'
            ) {
                source += event.delta.text
            }
        }
    }
</script>

<SvelteMarkdown {source} />

With the OpenAI SDK (ChatGPT)

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import OpenAI from 'openai'

    let source = $state('')

    async function streamResponse(prompt) {
        const client = new OpenAI()

        const stream = await client.chat.completions.create({
            model: 'gpt-4o',
            messages: [{ role: 'user', content: prompt }],
            stream: true
        })

        for await (const chunk of stream) {
            const delta = chunk.choices[0]?.delta?.content
            if (delta) {
                source += delta
            }
        }
    }
</script>

<SvelteMarkdown {source} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'
    import OpenAI from 'openai'

    let source = $state('')

    async function streamResponse(prompt) {
        const client = new OpenAI()

        const stream = await client.chat.completions.create({
            model: 'gpt-4o',
            messages: [{ role: 'user', content: prompt }],
            stream: true
        })

        for await (const chunk of stream) {
            const delta = chunk.choices[0]?.delta?.content
            if (delta) {
                source += delta
            }
        }
    }
</script>

<SvelteMarkdown {source} />

With fetch and Server-Sent Events

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let source = $state('')

    async function streamFromAPI(prompt) {
        const response = await fetch('/api/chat', {
            method: 'POST',
            body: JSON.stringify({ prompt }),
            headers: { 'Content-Type': 'application/json' }
        })

        if (!response.ok) throw new Error(`HTTP ${response.status}`)

        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        try {
            while (true) {
                const { done, value } = await reader.read()
                if (done) break
                source += decoder.decode(value, { stream: true })
            }
        } finally {
            reader.releaseLock()
        }
    }
</script>

<SvelteMarkdown {source} />
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let source = $state('')

    async function streamFromAPI(prompt) {
        const response = await fetch('/api/chat', {
            method: 'POST',
            body: JSON.stringify({ prompt }),
            headers: { 'Content-Type': 'application/json' }
        })

        if (!response.ok) throw new Error(`HTTP ${response.status}`)

        const reader = response.body.getReader()
        const decoder = new TextDecoder()

        try {
            while (true) {
                const { done, value } = await reader.read()
                if (done) break
                source += decoder.decode(value, { stream: true })
            }
        } finally {
            reader.releaseLock()
        }
    }
</script>

<SvelteMarkdown {source} />

Performance Characteristics

We measured render performance across different streaming speeds and chunking strategies using the interactive streaming demo:

Streaming SpeedChunk ModeAvg RenderPeak RenderDropped Frames
30 words/secWord~3ms~11ms0
100 chars/secCharacter~4ms~21ms0
50 words/secWord~3ms~12ms0

All render times stay well under the 16.7ms frame budget (60fps), meaning the browser has time to paint every frame without jank. Even at 100 characters per second in character mode (a worst-case scenario far beyond real LLM speeds), average render time remains under 5ms.

How Render Time Scales

Render time grows linearly with document length because the full markdown source is re-parsed on each update. For a typical LLM response (~2,000 characters), the overhead is negligible:

  • 0-500 chars: <1ms per render
  • 500-1,000 chars: ~2-3ms per render
  • 1,000-2,000 chars: ~5-7ms per render
  • 2,000+ chars: ~7-10ms per render

For very long documents (10,000+ characters), consider the optimization strategies below.

Best Practices

1. Prefer Word-Level Chunking

If you control the chunking strategy (e.g., in a custom SSE endpoint), emit tokens at word boundaries rather than individual characters. This reduces the total number of re-renders while producing the same visual result:

// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
    buffer += token
    if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
        controller.enqueue(encoder.encode(buffer))
        buffer = ''
    }
}
if (buffer) controller.enqueue(encoder.encode(buffer))
// Server-side: buffer tokens and emit at word boundaries
let buffer = ''
for await (const token of llmStream) {
    buffer += token
    if (buffer.endsWith(' ') || buffer.endsWith('\n')) {
        controller.enqueue(encoder.encode(buffer))
        buffer = ''
    }
}
if (buffer) controller.enqueue(encoder.encode(buffer))

2. Use Token Caching for Chat History

When displaying a conversation with multiple messages, previously completed messages are re-rendered on every update. Enable token caching so that completed messages skip re-parsing:

<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let messages = $state([])
</script>

{#each messages as message}
    <!-- Completed messages hit the token cache automatically -->
    <SvelteMarkdown source={message.content} />
{/each}
<script>
    import SvelteMarkdown from '@humanspeak/svelte-markdown'

    let messages = $state([])
</script>

{#each messages as message}
    <!-- Completed messages hit the token cache automatically -->
    <SvelteMarkdown source={message.content} />
{/each}

3. Debounce for Extremely Fast Streams

If your LLM stream is unusually fast (100+ tokens/second) and you notice frame drops, you can batch updates using requestAnimationFrame:

let pending = ''
let rafScheduled = false

function onToken(token) {
    pending += token
    if (!rafScheduled) {
        rafScheduled = true
        requestAnimationFrame(() => {
            source += pending
            pending = ''
            rafScheduled = false
        })
    }
}
let pending = ''
let rafScheduled = false

function onToken(token) {
    pending += token
    if (!rafScheduled) {
        rafScheduled = true
        requestAnimationFrame(() => {
            source += pending
            pending = ''
            rafScheduled = false
        })
    }
}

This coalesces multiple tokens into a single render per frame, reducing total renders from ~100/sec to ~60/sec while maintaining smooth visual output.

Estimating Streaming Costs

When building production LLM streaming UIs, understanding token costs is as important as render performance. Each streamed token has a price that varies by model, provider, and whether it’s an input or output token. ModelPricing.ai provides a pricing estimation API that covers all major LLM providers — useful for displaying real-time cost tracking alongside your streamed responses, setting usage budgets, or building cost-aware model selection into your application.

Try It Live

Experiment with different streaming speeds, jitter, and chunk modes in the interactive LLM streaming demo.

Related