Running Llama 3 in the Browser!

Nima Sarang

1 Introduction

Recently I came across a few cool projects that allow running LLMs (large language models) in the browser and utilize GPUs for fast inference! This is exciting for two main reasons: first, it’s great to run LLMs locally without sending your data to any external server. You can use custom prompts, configure inference the way you want, and ensure your privacy. Second, you don’t have to deal with setting up Python environments, installing dependencies, and configuring the GPU. The browser takes care of all that! It’s incredibly convenient. With the release of smaller Llama 3.2 models, like the 1B parameter version, it’s now possible to generate quality text directly in your browser, perhaps even matching ChatGPT 3.5 on some tasks.

The chat UI below will allow you to download a number of models hosted on HuggingFace, including Llama and Qwen variants. Once you download the model you can disconnect and continue chatting offline. I’ve tried adding a bunch of customizations to play around, but the UI is meant more as a demonstration to showcase the potential of a web UI.

2 Chat UI

Your browser must support WebGPU for a fast inference. You can find the list of supported browsers here. I personally tested it on Apple M1-M3 and Nvidia GPUs, and it worked quite well. If WebGPU is not supported, the engine will fall back on WebAssembly, which is slower but still functional.

Note

Based on your browser’s information, you have the following backend ☞
For more info, visit https://webgpureport.org/.

If you see “WebGPU not supported” in the status above but you’re certain your browser supports it, you can leave a comment below and I’ll try to help you troubleshoot the issue.

Here are some models worth trying:

Llama 3.2: Operates approximately three times faster than comparable models, making it suitable for tasks requiring quick responses.
DeepSeek R1: Excels in reasoning, mathematics, and coding tasks, comparable to top proprietary models.
Qwen 2.5: Excels in complex tasks, including reasoning and comprehension, with an expanded context window.

Model Settings

Model The first download may take a little bit. Subsequent loads will read from cache.

System Prompt

Temperature

0.7

Max Tokens

Top P

0.9

Frequency Penalty

0

Presence Penalty

0

My Chat

3 How It Works

The code above runs using WebLLM, a library that enables running LLMs in browser without requiring a server. It’s built on top of the MLC-LLM (Machine Learning Compiler for LLM) framework, which optimizes models for efficient execution on various hardware platforms. MLC-LLM itself is an application of Apache TVM, a machine learning compiler stack that takes in pre-trained models, compiles and generates deployable modules that can be embedded and run everywhere. MLC-LLM specializes in LLM-specific graph transformations and optimized kernels for common operations.

One of the major advantages of WebLLM is that it can utilize WebGPU for accelerated inference, which is a web standard provided by most modern browsers that allows developers low-level access to the GPU for general-purpose computing tasks. Make sure your browser is up-to-date.

To initialize the engine and download the model, I use the following code:

// Import WebLLM
import * as webllm from 'https://esm.run/@mlc-ai/web-llm';

// Initialize the engine
engine = new webllm.MLCEngine();
// Download and initialize the model or load from cache if available
await engine.reload('Llama-3.2-1B-Instruct-q4f32_1-MLC', config);
// List of available models can be found in
console.log(webllm.prebuiltAppConfig.model_list);

With the engine initialized, I use the following to generate responses:

/**
 * Generate a response using the engine
 * @param {Array} messages - Array of conversation history. Each message is a 
     dictionary with keys `role` and `content`.
     - `role`: `user` or `assistant` or `system`. The prompt is encoded as the
        first message with the role `system`.
     - `content`: the message content
 * @param {Function} onUpdate - Callback function to update the UI with the generated message
 * @param {Function} onFinish - Callback function to handle the final message
 * @param {Function} onError - Callback function to handle errors    
 */
const streamingGenerating = async (messages, onUpdate, onFinish, onError) => {
    try {
        let curMessage = "";
        let usage;
        // The model configuration such as temperature, max_tokens, etc.
        const config = modelConfig.getConfig();
        const completion = await engine.chat.completions.create({
            stream: true,
            messages,
            ...config,
            stream_options: { include_usage: true },
        });
        // Stream the completion
        for await (const chunk of completion) {
            const curDelta = chunk.choices[0]?.delta.content;
            if (curDelta) {
                curMessage += curDelta;
            }
            if (chunk.usage) {
                usage = chunk.usage;
            }
            // Update the UI
            onUpdate(curMessage);
        }
        // Get the final message
        const finalMessage = await engine.getMessage();
        onFinish(finalMessage, usage);
    } catch (err) {
        onError(err);
    }
};

You learn more about the other serve options and how to use them in the documentation.

4 Final Thoughts

As of writing this I can say WebLLM is the fastest library to run chat LLMs over the web. For other tasks like speech to text, image generation, etc., you can look into using the combination of Transformers.js and ONNX Runtime Web, or only ONNX Runtime Web alone if you’re looking for a more general-purpose solution. If you have any questions or feedback, you can leave a comment below.

5 Change Log

2025/01/22: Added prompt configuration settings and updated the UI.
2024/10/12: Added “How It Works” section.
2024/10/04: Added Markdown support for the output messages.
2024/10/02: Added download size estimation for the models from the HF repo.
2024/10/01: Model configuration settings added to the UI with tooltips.
2024/09/29: Initial release.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{sarang2024,
  author = {Sarang, Nima},
  title = {Running {Llama} 3 in the {Browser!}},
  date = {2024-10-06},
  url = {https://www.nimasarang.com/blog/2024-09-29-llama-in-browser/},
  langid = {en}
}

For attribution, please cite this work as:

Sarang, Nima. 2024. “Running Llama 3 in the Browser!” October 6, 2024. https://www.nimasarang.com/blog/2024-09-29-llama-in-browser/.