Main Github
Hugging Face Integration
Community Discord
Main Github
Hugging Face Integration
Community Discord
  • RWKV Language Model
  • Getting Started
    • How to Experience RWKV
    • RWKV Decoding Parameters
    • Integrate with your application
    • Frequently Asked Questions
  • RWKV Prompting
    • Prompting Format Guidelines
    • Chat Prompt Examples
    • Completion Prompt Examples
  • Advanced
    • Fine-tuning
    • Preparing The Training Datasets
    • RWKV Training Environment
    • RWKV Architecture History
    • RWKV pip Usage Guide
  • Inference Tutorials
    • llama.cpp Inference
    • Ollama Inference
    • Silly Tavern Inference
    • Text Generation WebUI Inference
    • KoboldCpp Inference
    • Ai00 Inference
  • Fine Tune Tutorials
    • State Tuning Tutorial
    • LoRA Fine-Tuning Tutorial
    • PiSSA Fine-Tuning Tutorial
    • DiSHA Fine-Tuning Tutorial
    • FAQ about Fine-Tuning
  • Community
    • Code Of Conduct
    • Contributing to RWKV
    • Various RWKV related links

How to Experience RWKV

This section will guide you on how to experience the RWKV model (either through online services or local deployment):

The following content can guide you in experiencing the RWKV model (online service or local deployment).

Tips

Before experiencing the RWKV model, it is recommended to read the following chapter:

  • RWKV Decoding Parameters
  • Prompting Format Guidelines

RWKV Online DEMO

If you simply want to try RWKV Goose, checkout the following public demo.

Hugging Face (Completion Mode)

  • HF-Gradio-1: RWKV-7-World-2.9B
  • HF-Gradio-2: The latest RWKV-7 G1 model

Warning

The above public demos only support Completion mode and do not support direct dialogue.

If you experience the RWKV model in a public demo, it is recommended to input prompts in the two standard formats of RWKV:

User: Please translate the following Swedish sentence "hur lång tid tog det att bygga twin towers" into Chinese.

Assistant:
Instruction: Please translate the following Swedish into Chinese.

Input: hur lång tid tog det att bygga twin towers

Response:

Hugging Face (Chat Mode)

We also provide an online chat demo for the RWKV-7 G1 series reasoning models:

  • RWKV-7 G1 Series Reasoning Models

In this demo, you can experience the fully trained RWKV-7 G1 0.1B/0.4B models and switch to other ongoing training G1 models, such as the G1 1.5B/2.9B models in training.

Info

This beautifully designed RWKV chat interface is open-sourced by RWKV community member @Leon, with the repository available at: web-rwkv-realweb.

RWKV APP Demo

The RWKV community has completed the development of Android and iOS apps, and now everyone can experience locally deployed RWKV-7 series models on their mobile phones!!!

After loading the model, RWKV models can run locally on your device.

  • Android download link: https://www.pgyer.com/rwkvchat
  • iOS download link: https://testflight.apple.com/join/DaMqCNKh

Below is the actual performance on a phone equipped with the Qualcomm Snapdragon 8Gen2 chip:

App-demo

RWKV7-G1 0.4B can consistently achieve an output speed of 30 tokens per second on this mid-range phone!

Info

These two apps are open-sourced by RWKV community members @MollySophia and @HaloWang, repository link: rwkv_mobile_flutter

RWKV WebGPU Demo

The RWKV WebGPU Demo based on web-rwkv is developed by the community.

No need to download any applications, the RWKV WebGPU Demo runs the RWKV-7 model locally in the browser, supporting chat, solving the 15 puzzles, music composition, and visualizing state changes!

After loading the model, the RWKV model will run offline in the browser without any server communication.

Chat Function

In the chat interface, select an RWKV-7-World model (0.1B or 0.4B), click the Load Model button to download and run the model for dialogue.

You can also drag an RWKV-7-World model (.st format) from the local directory into the gray box to run, skipping the download process.

WebGPU Demo Chat Function

Solving 15 Puzzles

Info

The 15 puzzle (also called the sliding puzzle or 15-number puzzle) is a classic sliding block puzzle game. Players need to place numbers 1-15 in a 4x4 grid with one empty space, then slide the numbers to arrange them in order.

In the Demo's 15 puzzle interface, click the New Game button to set up a new 15 puzzle.

Click the start button, and the WebGPU Demo will run the RWKV-puzzle15 model to automatically solve the current 15 puzzle, with the model's CoT reasoning process displayed on the left.

web-rwkv-15-pizzle-demo

Music Composition Function

In the Demo's Music interface, you can drive the RWKV ABC model to compose music. The steps are as follows:

  1. Click the Load Model button to download the composition model.
  2. Click the prompt drop-down box to select an ABC format prompt.
  3. Click the Generate Music button to start composing.

web-rwkv-musci-demo

State Replay

In the Demo's State Replay interface, you can view the hidden state evolution of the RWKV as an RNN model.

Tips

The State Replay function requires starting an RWKV model in the chat interface in advance.

The image below shows the hidden state evolution of the RWKV-7-World-0.1B model after inputting "你好" (Hello).

The design of RWKV-7-World-0.1B is L12-D768, so we can view the state evolution of the model's 12 layers in State Replay, with each layer divided into 12 visualization small grids of 64×64 dimensions (one head).

Explanation of the small grid colors:

  • Dark blue: Lower values or values close to negative numbers.
  • Yellow: Higher values or values close to positive numbers.
  • Gray or black: Values close to 0.

web-rwkv-state-replay-demo

Local Deployment of RWKV Model

If you wish to deploy and use the RWKV model locally on your own device, it is recommended to use the following tools:

RWKV Runner

  • RWKV Runner Project
  • installer download (do read the installer README instructions)

RWKV Runner Demo

Windows setup guide video

AI00 RWKV Server

Ai00 Server is an RWKV language model inference API server based on the web-rwkv inference engine. It is also an open-source software under the MIT license, developed by the Ai00-x development group led by @cryscan and @顾真牛, members of the RWKV open-source community.

Ai00 Server supports Vulkan as the inference backend, Vulkan parallel and concurrent batch inference, and can run on all GPUs that support Vulkan. In fact, Ai00 Server supports most NVIDIA, AMD, and Intel graphics cards (including integrated graphics).

While maintaining high compatibility, Ai00 Server does not require bulky pytorch, CUDA, or other runtime environments. It is compact, ready to use out of the box, and supports INT8/NF4 quantization, allowing it to run at high speed on most personal computers.

For specific usage of AI00, please refer to the Ai00 Usage Tutorial.

ChatRWKV

ChatRWKV is the official RWKV chatbot project but does not have a graphical interface. You may need some command-line knowledge to use ChatRWKV.

ChatRWKV GitHub Repository

VRAM Requirements for Local Deployment

It is recommended to deploy and infer the RWKV model locally using FP16 precision. When your VRAM and memory are insufficient, you can use INT8 or NF4 quantization methods to run the RWKV model, reducing VRAM and memory requirements.

Tips

In terms of response quality, models with the same parameters perform best with FP16, INT8 is comparable to FP16, and NF4 response quality is significantly lower than INT8.

Model parameters are more important than quantization. For example, a 7B model with INT8 quantization generates better results than a 3B model with FP16.

Below are the VRAM requirements and generation speed for locally deploying and running the RWKV model:

Tips

Test Environment:

  • CPU: i7-10870H
  • GPU: RTX 3080 Laptop, 16GB VRAM
  • Memory: 32GB
VRAM Requirements

Below are the VRAM/memory requirements for different inference backends and corresponding quantization methods (default quantization for all layers):

Inference Backend1B6 Model3B Model7B Model14B Model
CPU-FP326.6GB Memory12GB Memory21GB MemoryOOM (Not Recommended)
rwkv.cpp-FP163.5GB Memory7.6GB Memory15.7GB Memory30GB (Memory)
rwkv.cpp-Q5_12GB Memory3.7GB Memory7.2GB Memory12.4GB (Memory)
CUDA-FP163.2GB VRAM6.2GB VRAM14.3GB VRAMAbout 28.5GB VRAM
CUDA-INT81.9GB VRAM3.4GB VRAM7.7GB VRAM15GB VRAM
webgpu-FP163.2GB VRAM6.5GB VRAM14.4GB VRAMAbout 29GB VRAM
webgpu-INT82GB VRAM4.4GB VRAM8.2GB VRAM16GB VRAM (Quantized 41 layers, 60 layers about 14.8GB)
webgpu-NF41.3GB VRAM2.6GB VRAM5.2GB VRAM15.1GB VRAM (Quantized 41 layers, 60 layers about 10.4GB)
webgpu(python)-FP163GB VRAM6.3GB VRAM14GB VRAMAbout 28GB VRAM
webgpu(python)-INT81.9GB VRAM4.2GB VRAM7.7GB VRAM15GB VRAM (Quantized 41 layers)
webgpu(python)-NF41.2GB VRAM2.5GB VRAM4.8GB VRAM14.3GB VRAM (Quantized 41 layers)
Generation Speed

Generation speed (unit: TPS, approximately equal to the number of words per second) for different inference backends/quantization (default quantization for all layers).

Inference Backend1B6 Model3B Model7B Model14B Model
CPU-FP324.362.3Very SlowOOM (Not Recommended)
rwkv.cpp-FP168.64.521
rwkv.cpp-Q5_11483.72.1
CUDA-FP16251815
CUDA-INT82216187.4
webgpu-FP16453821OOM, Unable to Test
webgpu-INT860443018
webgpu-NF460473420
webgpu(python)-FP16402917OOM, Unable to Test
webgpu(python)-INT845352315
webgpu(python)-NF443322118

Info

  • CUDA, CPU from RWKV pip Package
  • rwkv.cpp from rwkv.cpp Project
  • webgpu from web-rwkv Project, a Rust Inference Framework Based on webgpu
  • webgpu(python) from web-rwkv-py, Python binding for the web-rwkv Project

The above parameters are only introductory performance references for RWKV inference. As the number of quantization layers and other configuration items change, and the architecture of the graphics card varies, the performance of the model may change.

Edit this page
Last Updated:
Contributors: luoqiqi, manjuan
Prev
RWKV Language Model
Next
RWKV Prompting