1 Introduction
Recently, I came across a few cool projects that allow running LLMs (large language models) in the browser, and utilize GPU for fast inference! This is exciting for two main reasons: first, it’s great to run LLMs locally without sending your data to any external server. You can use custom prompts, configure the inference the way you want, and ensure your privacy. Second, you don’t have to deal with setting up Python environments, installing dependencies, and configuring the GPU. The browser takes care of all that! It’s incredibly convenient. With the release of smaller Llama 3.2 models, like the 1B parameter version, it’s now possible to generate quality text directly in your browser, perhaps even matching ChatGPT 3.5 on some tasks.
The chat UI below will allow you to download a number of models hosted on HuggingFace, including Llama and Qwen variants. Once you download the model, you can disconnect and continue chatting offline. I’ve tried adding a bunch of customizations to play around, but the UI is meant more as a demonstration to showcase the potential of a web UI.
2 Chat UI
Your browser must support WebGPU for a fast inference experience. You can see the list of browsers that support it here. I personally tested it on Apple M1-M3 and Nvidia GPUs, and it worked quite well. If WebGPU is not supported, the engine will fall back to WebAssembly, which is slower but still functional.
Based on your browser’s information, you have the following backend ☞
For more info, visit https://webgpureport.org/.
If you see “WebGPU not supported” in the status above but you’re certain your browser supports it, you can leave a comment below and I’ll try to help you troubleshoot the issue.
My Chat
3 How It Works
The code above runs using WebLLM, a project that enables running LLMs in browser without requiring a server. It’s built on top of the MLC-LLM (Machine Learning Compiler for LLM) framework, which optimizes models for efficient execution on various hardware platforms. MLC-LLM itself is an application of Apache TVM, a machine learning compiler stack that takes in pre-trained models, compiles and generates deployable modules that can be embedded and run everywhere. MLC-LLM specializes in LLM-specific graph transformations and optimized kernels for common operations.
One of the major advantages of WebLLM is that it can utilize WebGPU for accelerated inference. WebGPU is a web standard provided by most modern browsers that allows developers low-level access to the GPU for general-purpose computing tasks. It’s a new feature that is being adopted by browsers.
// Import WebLLM
import * as webllm from 'https://esm.run/@mlc-ai/web-llm';
// Initialize the engine
engine = new webllm.MLCEngine();
// Download and initialize the model or load from cache if available
await engine.reload('Llama-3.2-1B-Instruct-q4f32_1-MLC', config);
// List of available models can be found in
console.log(webllm.prebuiltAppConfig.model_list);
With the engine initialized, I use the following to generate responses:
/**
* Generate a response using the engine
* @param {Array} messages - Array of conversation history. Each message is a
dictionary with keys `role` and `content`.
- `role`: `user` or `assistant` or `system`. The prompt is encoded as the
first message with the role `system`.
- `content`: the message content
* @param {Function} onUpdate - Callback function to update the UI with the generated message
* @param {Function} onFinish - Callback function to handle the final message
* @param {Function} onError - Callback function to handle errors
*/
const streamingGenerating = async (messages, onUpdate, onFinish, onError) => {
try {
let curMessage = "";
let usage;
// The model configuration such as temperature, max_tokens, etc.
const config = modelConfig.getConfig();
const completion = await engine.chat.completions.create({
stream: true,
messages,
...config,
stream_options: { include_usage: true },
});
// Stream the completion
for await (const chunk of completion) {
const curDelta = chunk.choices[0]?.delta.content;
if (curDelta) {
curMessage += curDelta;
}
if (chunk.usage) {
usage = chunk.usage;
}
// Update the UI
onUpdate(curMessage);
}
// Get the final message
const finalMessage = await engine.getMessage();
onFinish(finalMessage, usage);
} catch (err) {
onError(err);
}
};
You learn more about the other serve options and how to use them in the documentation.
4 Final Thoughts
As of writing this I can WebLLM is the fastest library to run chat LLMs over web. For other tasks like speech to text, image generation, etc., you can look into the combination of Transformers.js and ONNX Runtime Web, or just ONNX Runtime Web if you’re looking for a more general-purpose solution.If you have any questions or feedback, feel free to leave a comment below.
5 Change Log
- 2024/10/12: Added “How It Works” section.
- 2024/10/04: Added Markdown support for the output messages.
- 2024/10/02: Added download size estimation for the models from the HF repo.
- 2024/10/01: Model configuration settings added to the UI with tooltips.
- 2024/09/29: Initial release.
Reuse
Citation
@online{sarang2024,
author = {Sarang, Nima},
title = {Running {Llama} 3 in the {Browser!}},
date = {2024-10-06},
url = {https://www.nimasarang.com/blog/2024-09-29-llama-in-browser/},
langid = {en}
}