Main Github
Hugging Face Integration
Community Discord
Main Github
Hugging Face Integration
Community Discord
  • RWKV Language Model
  • Getting Started
    • How to Experience RWKV
    • RWKV Decoding Parameters
    • Integrate with your application
    • Frequently Asked Questions
  • RWKV Prompting
    • Prompting Format Guidelines
    • Chat Prompt Examples
    • Completion Prompt Examples
  • Advanced
    • Fine-tuning
    • Preparing The Training Datasets
    • RWKV Training Environment
    • RWKV Architecture History
    • RWKV pip Usage Guide
  • Inference Tutorials
    • llama.cpp Inference
    • Ollama Inference
    • Silly Tavern Inference
    • Text Generation WebUI Inference
    • KoboldCpp Inference
    • Ai00 Inference
  • Fine Tune Tutorials
    • State Tuning Tutorial
    • LoRA Fine-Tuning Tutorial
    • PiSSA Fine-Tuning Tutorial
    • DiSHA Fine-Tuning Tutorial
    • FAQ about Fine-Tuning
  • Community
    • Code Of Conduct
    • Contributing to RWKV
    • Various RWKV related links

RWKV Architecture History

Tips

If you are not familiar with machine learning, I would highly recommend Andrej Karpathy's series on neural networks - that would help provide a good foundation on various neural network architecture concepts.

The following is knowledge related to the RWKV architecture, including the origin of the RWKV architecture name, RWKV architecture features, architecture history, and model releases for each architecture.

RWKV is a variant of RNN. Therefore, it is necessary to first introduce: What is the RNN architecture, and what is the hidden state of the RNN architecture?

RNN Architecture and Hidden State

Recurrent Neural Network (RNN) is a neural network model widely used in the field of deep learning.

During the operation of the RNN network, a hidden state (state) is maintained, which can be regarded as the "mental state" of the RNN model. When humans think, the brain retains the "key information" most relevant to the current event. As the content of thinking changes, the "key information" in our minds is constantly updated. Similarly, the RNN network continuously updates its hidden state through specific functions.

Classic RNN Diagram

As shown in the figure, the RNN network processes each input token in sequence and predicts the next possible token (if needed) based on the "current hidden state". After processing a token, the RNN feeds the result back to the neural network, thereby "updating its hidden state"; then, RNN uses the "updated state" to predict the next token. This cycle continues until the "task completion" state is reached.

Info

As a variant of RNN, RWKV supports fine-tuning of the hidden state (state tuning). After tuned with specific "mental state", RWKV performs better on various downstream tasks.

What does RWKV stands for

The name of the RWKV architecture comes from the four main parameters used in the time-mixing and channel-mixing blocks:

  • R: The Receptance vector controls the receptance of past information.
  • W: The Weight signifies the positional weight decay vector, a trainable parameter within the model.
  • K: The Key vector functions similar to K in Attention
  • V: The Value vector functions similar to V in Attention.

Development History of the RWKV Architecture

In 2020, BlinkDL began researching Transformers and immediately discovered two obvious improvement directions: introducing explicit decay and Token-shift (or short convolution). After testing on https://github.com/BlinkDL/minGPT-tuned, he found that these techniques significantly improved the performance of Transformers.

Later, he noticed Apple's Attention Free Transformer (AFT) paper and tested it, finding that these two techniques also brought significant performance improvements to AFT.

RWKV-V1

In August 2021, the first version of the RWKV architecture: RWKV-V1 released on RWKV-LM repository. The first commit of RWKV-V1 was on August 9, 2021.

RWKV-V1 uses long convolution instead of the Attention mechanism while its architecture consists of alternating Time-mix and Channel-mix. Channel-mix is a variant of the Transformer's GeGLU layer. Time-mix is a significant improvement over AFT:

 Time-mix :TMt,c=sigmoid(Rt,c)⋅∑uWt,u,c⋅softmaxt(Ku,c)⋅Vu,c Channel-mix CMt,c=sigmoid(Rt,c)⋅∑dWc,d⋅gelu(Kt,d)⋅Vt,d

where R, K, V are generated by linear transformation of the input, and W is the convolution kernel in the long convolution.

Tips

The structure design of Time-mix and Channel-mix is based on the four main parameters R, W, K, V, which is the origin of the name RWKV.

Warning

RWKV-V1 is closer to Linear Transformer rather than RNN, as each time step depends on all previous inputs.

RWKV-V2-RNN

The RWKV-V2 version was the first RNN model of RWKV, with the pseudo-code diagram as follows:

RWKV-v2-RNN-Architecture

Supplementary explanation of the code diagram:

  • a and b are the EMA (exponential moving average) of kv and k
  • c and d are a and b plus the pure self-attention effect (self-attention at the same position)
  • c / d is the memory mechanism: if a character has a strong k in a certain channel and W is close to 1, the character will be remembered by the subsequent text
  • T, K, V, R, W, X, P are trainable parameters
  • W is initialized with precomputed values, similar to Alibi (Attention with Linear Biases), but they are trainable non-fixed values and can be different in each channel. This design significantly enhances expressive power of model.
  • Invented a headQK mechanism, allowing the model to directly copy or avoid generating certain characters from previous text. This mechanism is important for ICL (in-context learning) and was found by other researchers years later. However, to push the limits of pure RNN model, the main RWKV model never equipped this feature.
q = self.head_q(x)[:,:T,:] 
k = self.head_k(x)[:,:T,:] 
c = (q @ k.transpose(-2, -1)) * (1.0 / 256)
c = c.masked_fill(self.copy_mask[:T,:T] == 0, 0)
c = c @ F.one_hot(idx, num_classes = self.config.vocab_size).float()       
x = self.head(x) + c

Tips

Through exponential decay, RWKV-V2 achieved RNN-style inference: the model's current generation depends on the last output and current input, while having a fixed-size hidden state (a and b in RWKV-V2).

RWKV-V2's self-attention layer simplified expression:

xt+1=σ(Rxt)⋅exp⁡(Kxt)⋅(Vxt)+exp⁡(W)⋅atexp⁡(Kxt)+exp⁡(W)⋅bt

RWKV-V3

RWKV-V3 is a short-term transitional version, which uses more comprehensive token-shift compared to that of RWKV-V2 (using different trainable TimeMix factors for R / K / V in SA and FF layers respectively):

xx = self.time_shift(x)
xk = x * self.time_mix_k + xx * (1 - self.time_mix_k)
xv = x * self.time_mix_v + xx * (1 - self.time_mix_v)
xr = x * self.time_mix_r + xx * (1 - self.time_mix_r)

Diagram of RWKV-V3's token-shift improvement compared to V2:

RWKV-V3 token-shift improvement diagram compared to V2

In addition, this version uses preLN instead of postLN (more stable while converges faster):

if self.layer_id == 0:
    x = self.ln0(x)
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

RWKV-V4

RWKV-V4 is the first official version of the RWKV architecture, which is named "Dove". RWKV-V4 solved the numerical stability issues of the RWKV-V3 architecture. The first paper of the RWKV project 《RWKV: Reinventing RNNs for the Transformer Era》 is released at the same time.

The RWKV-V4 architecture paper was co-authored by RWKV author BlinkDL and the RWKV community, first published on May 22, 2023. In October of the same year, the RWKV-V4 architecture paper was accepted by EMNLP 2023.

Model architecture diagram from the RWKV-V4 paper

Info

The above figure is an overview of the RWKV-V4 model architecture from the paper, where:

  • Left: RWKV-V4's time-mixing and channel-mixing modules
  • Right: RWKV-V4's language modeling process

Token Shift Concept

The RWKV-V4 paper formally proposed the concept of "Token Shift": mixing each token with all previous token. This process is similar to a one-dimensional causal convolution with a size of 2. Token Shift allows the model to control how much new information and old information would be allocated to each head's reception, key, value, and gate vectors (i.e., r, k, v, and g).

The architecture diagram of RWKV-V4 language modeling demonstrates the Token Shift process of RWKV-V4 and the state-update process of RWKV:

RWKV-V4 language modeling architecture

Note

A major feature of the RWKV-V4 version is that the state (RNN's hidden state) is very small, because BlinkDL focused on achieving the best performance with smallest state.

With D as the main dimension of the model and L as the number of layers of the model, the state size of the RWKV-4 model is 5DL, including 2DL for mixing and 3DL for WKV.

RWKV-V4 Model Release

The research of RWKV-V4 (architecture iteration and model training, etc.) spanned 2022 and 2023. Below are 4 types of major models released:

  • RWKV-4-Pile: Pre-trained on the 331B tokens Pile dataset, including models with 169m, 430m, 1B5, 3B, 7B, and 14B parameters
  • RWKV-4-Raven: Instruction fine-tuned model of RWKV-4-Pile, fine-tuned using open-source datasets such as Alpaca, CodeAlpaca, Guanaco, GPT4All, ShareGPT, including models with 1B5, 3B, 7B, and 14B parameters.
  • RWKV-4-World: Multilingual model trained on the RWKV World v1 dataset (over 100 world languages, training data includes Pile), including base models with 169m, 430m, 1B5, 3B, 7B parameters, and some Chinese fine-tuned models.
  • RWKV-4-Music: Music-composition models in MIDI and ABC formats, trained on bread-midi-dataset and irishman music score data. The ABC model has 82m parameters, and the MIDI models have 120m and 560m parameters.

RWKV-V4 Model Comparison Table:

Model NameDescriptionTraining Data (Data Volume)Vocabulary (Vocabulary Size)
RWKV-4-PilePre-trained language model based on the Pile datasetEleutherAI/pile (331B tokens)20B_tokenizer (50,277)
RWKV-4-RavenInstruction fine-tuned language model of RWKV-4-PileAlpaca, CodeAlpaca, Guanaco, GPT4All, ShareGPT, etc.20B_tokenizer (50,277)
RWKV-4-WorldPre-trained language model based on a dataset of over 100 world languagesRWKV World v1 (590B tokens)rwkv_vocab_v20230424 (65,536)
RWKV-4-Music (MIDI)Music (composition) model trained on MIDI music score databread-midi-datasetMIDI-LLM-tokenizer (20,095)
RWKV-4-Music (ABC)Music (composition) model trained on ABC music score datairishman-

In addition, RWKV-V4 trained a 14B parameter model for the first time. Moreover, from RWKV-V4-World, the World series dataset containing more than 100 world languages and the corresponding multilingual vocabulary rwkv_vocab_v20230424 were officially used.

Info

The World series dataset and rwkv_vocab_v20230424 vocabulary are used in subsequent RWKV architectures such as RWKV-V5 and RWKV-V6.

The World dataset will continue adding new training data:

  • World v1 ≈ 0.59T tokens
  • World v2 ≈ 1.1T tokens
  • World v2.1 ≈ 1.42T tokens, total training data for v2.1 models ≈ 2.5T tokens
  • World v3 ≈ 3T tokens, total training data for v3 models ≈ 5.5T tokens

RWKV-V5

RWKV-V5 is an improved version of the RWKV-V4 architecture, codename "Eagle".

RWKV-V5 and RWKV-V6 architectures were released in the same paper 《Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence》.

The paper was co-authored by RWKV author BlinkDL and the RWKV community, first published on April 9, 2024. In October of the same year, the RWKV 5/6 architecture paper was accepted by the international conference COLM 2024.

RWKV 5/6 Architecture Overview

Info

The above figure is an overview of the RWKV 5/6 architecture from the paper, where:

  • Left: time-mixing and channel-mixing blocks
  • Top right: RWKV time-mixing block as RNN cell
  • Bottom middle: token-shift module in FeedForward module and Eagle time-mixing
  • Bottom right: token-shift module in Finch time-mixing
  • Dashed arrows (left, top right): indicate a connection in Finch, but not in Eagle

RWKV-V5 Architecture Optimization Details

Compared to RWKV-V4, the most important change in RWKV-V5 is the introduction of multi-headed, matrix-valued states, as described in the paper.

In RWKV-V4's time mixing computation, the parameters u, w, k, v are all vectors with dimension D, and the head size is 1, resulting in a state vector with dimension D:

tState st∈RDOutput yt∈RD
0s0=u⊙k0⊙v0u⊙k0y0=σ(r0)⊙s0
1s1=u⊙k1⊙v1+k0⊙v0u⊙k1+k0y1=σ(r1)⊙s1
2s2=u⊙k2⊙v2+k1⊙v1+w⊙k0⊙v0u⊙k2+k1+w⊙k0y2=σ(r2)⊙s2
3s3=u⊙k3⊙v3+k2⊙v2+w⊙k1⊙v1+w2⊙k0⊙v0u⊙k3+k2+w⊙k1+w2⊙k0y3=σ(r3)⊙s3

RWKV-V5 splits u, w, k, v into groups of vectors with dimension 64, where each group's k and v are multiplied through outer product to form a 64×64 matrix, which becomes a head of the state. The head size is fixed at 64.

The time-mixing steps for each head in RWKV-V5:

tState st∈R64×64Output yt∈R64
0s0=diag(u)⋅k0⊤⋅v0y0=r0⋅s0
1s1=diag(u)⋅k1⊤⋅v1+k0⊤⋅v0y1=r1⋅s1
2s2=diag(u)⋅k2⊤⋅v2+k1⊤⋅v1+diag(w)⋅k0⊤⋅v0y2=r2⋅s2
3s3=diag(u)⋅k3⊤⋅v3+k2⊤⋅v2+diag(w)⋅k1⊤⋅v1+diag(w2)⋅k0⊤⋅v0y3=r3⋅s3

RWKV-V5 forward propagation (inference process) time-mixing calculation formula:

◻t=lerp◻(xt,xt−1)W◻,◻∈{r,k,v,g}w=exp⁡(−exp⁡(ω))wkvt=diag(u)⋅kt⊤⋅vt+∑i=1t−1diag(w)t−1−i⋅ki⊤⋅vi∈R(D/h)×(D/h)ot=concat(SiLU(gt)⊙LayerNorm(rt⋅wkvt))Wo∈RD

By transforming the vectors of RWKV-V4 into matrices, RWKV-V5's state calculation changes from "vector-based" to "matrix-valued states" with a dimension of 64×64, i.e., "matrix-valued states". Assuming the current RWKV model has a dimension of 512, it can be said that there are 512/64 = 8 heads (eight heads × 64 dimensions), which is the "multi-headed" concept of RWKV-V5.

Therefore, we can summarize the optimization details of RWKV-V5 as: RWKV-V5 eliminates the normalization term (the denominator in the RWKV-V4 formula) and introduces matrix-valued states instead of the previous vector-valued states.

In this way, RWKV-V5 cleverly expands the scale of the state, giving the RWKV model better memory and model capacity.

Tips

With D as the main dimension of the model and L as the number of layers of the model, the state size of the RWKV 5/6 model is 66DL, including 2DL for mixing and 64DL for WKV.

RWKV-V5 Iteration Process

In fact, the research of the RWKV-V5 architecture was not achieved overnight. From its proposal in July 2023 to its finalization in September 2023, the optimization process of RWKV-V5 can be divided into three sub-versions: RWKV-5.0, 5.1, and 5.2.

RWKV-5.0

RWKV-5.0 redesigned the wkv module of the RWKV-V4 architecture, where the k and v parameters were transformed from vectors with a dimension of D to matrices with a dimension of 64 * 64, so the wkv state of RWKV-5.0 is 32 times larger than that of RWKV-V4.

The RWKV-5.0 architecture only released a 0.1B model: RWKV-5-World-0.1B-v1-20230803-ctx4096, which was still trained based on the World-v1 dataset from the RWKV-V4 period.

RWKV-5.1

RWKV-5.1 introduced a Time-mixing gating mechanism on the basis of RWKV-5.0, which is an additional matrix Wg and a gating activation function SiLU were added to the time-mixing module.

The RWKV-5.1 architecture only released music models: RWKV-5-ABC-82M-v1-20230901-ctx1024 and RWKV-5-MIDI-560M-v1-20230902-ctx4096.

RWKV-5.2

RWKV-5.2 introduced diagonal decay matrices on the basis of RWKV-5.1, that is, the u and w vector parameters were diagonalized.

The RWKV-5.2 architecture released six types of models: 0.4B, 1B5, 3B, 7B, and 3B (ctx16k), which were trained based on the World-v2 dataset.

RWKV-V5 Subversion Comparison Table

Architecture VersionOptimization DetailsReleased ModelsDataset
RWKV-5.0Redesigned the wkv module based on RWKV-V4, where the k and v parameters are transformed from vectors with a dimension of D to matrices with a dimension of 64 * 64, so the wkv state of RWKV-5.0 is 32 times larger than that of RWKV-V4RWKV-5-World-0.1B-v1World-v1
RWKV-5.1Introduced a Time-mixing gating mechanism on the basis of RWKV-5.0, that is, an additional matrix Wg and a gating activation function SiLU were added to the time-mixing moduleRWKV-5-music series, including ABC-82M and MIDI-560Mirishman, bread-midi-dataset
RWKV-5.2Introduced diagonal decay matrices on the basis of RWKV-5.1, that is, the u and w vector parameters were diagonalizedRWKV-5-World-V2.1 series, including 0.4B, 1B5, 3B, 7B, and 3B (ctx16k)World-v2

Caution

RWKV-V5 series models are all outdated, it is recommended to use RWKV-V6 models.

RWKV-V6

The version code of RWKV-V6 is "Finch". This version was developed in October 2023 and is the currently (November 2024) stable architecture.

RWKV-V6 Complete Architecture Diagram

RWKV-V6 Architecture Optimization Details

RWKV-V6 introduced a dynamic recurrence mechanism based on LoRA, further optimizing the Token Shift and time-mixing processes.

RWKV-V5's Token Shift is similar to RWKV-V4, which is a very simple linear interpolation (lerp), and this linear interpolation is data-independent; parameter μ solely decide the proportion of the current token and the previous token mixed into the model input.

In RWKV-V5's Token Shift, the linear interpolation formula between x and x−1 is as follows:

lerp◻(a,b)=a+(b−a)⊙μx

RWKV-V6 borrowed the technology of low-rank adaptation (LoRA), replacing the static parameter μ with a dynamic LoRA: a trainable vector with a dimension of D was introduced to μx and each λ◻ , and a new trainable weight matrices was introduced to each A◻∈RD×32, B◻∈R32×D.

In RWKV-V6's Token Shift, the linear interpolation formula between x and x−1 is as follows:

lora◻(x)=λ◻+tanh⁡(xA◻)B◻ddlerp◻(a,b)=a+(b−a)⊙lora◻(a+(b−a)⊙μx)

Compared to RWKV-V4/RWKV-V5, RWKV-V6's new Token Shift, which enhances data dependency, effectively expands the model's capabilities. The amount of new and old data allocated to each channel depends on the current input and output from last time steps.

Tips

In general, this dynamic recurrence mechanism allows "important information" to effectively mark itself for use in subsequent tasks; while "unimportant information" can also mark itself so that model can reduce the probability of or completely avoid it entering the subsequent data flow, thereby reserving space for more important existing data.

In addition, if certain information is not useful for a specific task, the dynamic recurrence mechanism filtered out this information in advance.

RWKV-V6's dynamic Time Mixing calculation formula:

◻t=ddlerp◻(xt,xt−1)W◻,◻∈{r,k,v,g}dt=lorad(ddlerpd(xt,xt−1))wt=exp⁡(−exp⁡(dt))wkvt=diag(u)⋅ktT⋅vt+∑i=1t−1diag(⨀j=1i−1wj)⋅kiT⋅vi∈R(D/h)×(D/h)ot=concat(SiLU(gt)⊙LayerNorm(rt⋅wkvt))Wo∈RDwkv′=s+diag(u)⋅kT⋅vs′=diag(w)⋅s+kT⋅v

The attention calculation of wkvt can be written recursively as:

wkv′=s+diag(u)⋅kT⋅vs′=diag(w)⋅s+kT⋅v

Unlike RWKV-V5, wt in RWKV-V6 is not static throughout the sequence. This is the core change of RWKV-V6 decay: each channel of wt can change independently based on data dependency, whereas previously it was a fixed learning vector.

Warning

The above new LoRA mechanism is used to obtain the mixing vector. Note that the LoRA process itself uses the Token Shift value in the style of Eagle (RWKV-V5) as input (the μx in the bottom left of the RWKV-V6 complete architecture diagram) rather than just the latest token. The new time-varying decay wt also needs to apply LoRA again.

Intuitively, this is a second-order variant of Token Shift, allowing each channel of wt to change based on the mix of the current and previous tokens, and the mix itself is determined by the embeddings of the two tokens.

RWKV-V6 Model Release

The RWKV-V6 architecture itself does not have sub-versions, but different types of models have been released, including base models pre-trained with different training sets, as well as Chinese novel and Japanese fine-tuned models. The following is an overview of models based on the RWKV-V6 architecture:

Model CategoryModel StatusModel DescriptionTraining Data
RWKV-6-World-v2OutdatedMultilingual pre-trained model based on the World-v2 datasetWorld-v2
RWKV-6-World-v2.1StableMultilingual pre-trained model based on the World-v2.1 datasetWorld-v2.1
RWKV-6-World-v3In TrainingMultilingual pre-trained model based on the World-v3 datasetWorld-v3
RWKV-6-ChnNovelStableChinese novel fine-tuned model, fine-tuned with Chinese novel + instruction data based on RWKV-6-World-v2.1World-v2.1, Chinese novel data
RWKV-6-JpnStableJapanese fine-tuned model, fine-tuned with Japanese + instruction data based on RWKV-6-World-v2.1World-v2

RWKV-V6 State Tuning

Tips

In addition to the complete model weights, RWKV community developed state tuning during the iteration of the RWKV-V6 architecture. This is a novel fine-tuning method that fine-tunes the initial state of RWKV, which is equivalent to the most efficient prompt tuning. This method is excellent at alignment because of its strong transferability. For detailed methods of state tuning, please refer to RWKV-PEFT - State Tuning.

The fine-tuned state file can be merged into the base model or used as an "enhancement plugin for the RWKV model": that is, initializing the model's state before loading the base RWKV model to affect the model's response style, response format, etc.

RWKV officially released different types of state files:

State TypeState DescriptionApplicable RWKV Model
chn-single-roundEnhanced Chinese single-round dialogue, more in line with human language habits, with rich Emoji expressionsRWKV base model
eng-single-roundEnhanced English single-round dialogue, more in line with human language habits, with rich Emoji expressionsRWKV base model
chn-小说扩写-single-roundChinese single-round dialogue, will expand the novel based on user input (it is recommended to use RWKV-V6-ChnNovel model and state as a replacement)RWKV base model
chn-打油诗-single-roundChinese single-round dialogue, will create doggerel based on user inputRWKV base model
chn-文言文-single-roundChinese single-round dialogue, the response style will be biased towards classical ChineseRWKV base model
chn-文言文和古典名著-single-roundChinese single-round dialogue, the response style will be biased towards classical Chinese and classical literatureRWKV base model
OnlyForChnNovel_小说扩写Used to expand Chinese novels, suitable for RWKV-V6-ChnNovel models of the same sizeRWKV-V6-ChnNovel

RWKV-V7

The architecture code of RWKV-7 is "Goose". RWKV-7 surpasses the attention/linear attention paradigm, its "state evolution" is very flexible, which can solve problems that attention cannot solve with the same computational cost. At the same time, RWKV-7 surpasses the TC0 constraint.

The research of RWKV-7 began in September 2024, and its preview version RWKV-7 "Goose" x070.rc2-2409-2r7a-b0b4a's training code was first released at the commit of the RWKV-LM repository.

Tips

The paper on the RWKV-V7 architecture, "RWKV-7 "Goose" with Expressive Dynamic State Evolution", was officially released on March 18, 2025.

Paper link: https://arxiv.org/abs/2503.14456

rwkv-7-architecture

RWKV-7 Architecture Optimization Details

Tips

RWKV-V7 adopts Dynamic State Evolution. Through a series of innovations (such as the Generalized Delta Rule), RWKV-7 comprehensively surpasses the Transformer and the previous RWKV-6 architecture in terms of computational efficiency, task performance, and model expressiveness.

On the premise that the training data is much less than that of open-source models such as Qwen2.5 and Llama3.2, the language modeling ability of the RWKV-7-World model has reached the state-of-the-art (SoTA) level among all open-source models with a scale of 3 billion parameters.

By introducing the Generalized Delta Rule, RWKV-7 can achieve the S5 state tracking problem with a complexity of NC1 using only 2 layers, and can recognize all regular languages using 4 layers. Its expressiveness significantly exceeds the TC0 limitation of Transformers.

In general, traditional attention mechanisms (such as the QKV-softmax-attention of Transformers) store multiple k,v (key and value vector pairs) and use q (query vector) to match the key to get the corresponding value output.

RWKV-7 does not directly store k,v pairs but dynamically updates the state through computation, learning the relationship between key and value from the context, and then using the updated state to process new input q (which is r in RWKV) to get the output.

Specifically, the RWKV-7 model has an internal model v≈kS⊤. It needs to fit a simple objective: for given two vector sequences kt and vt, transform ki into vi through S (state), where the output v needs to be as close as possible to the target v.

To achieve this goal, RWKV-7 automatically simulates dynamic gradient descent for the L2 loss function L=12∥v−kS⊤∥2 during inference, continuously training the internal model v≈kS⊤.

Gradient formula for state:

∂L∂S=Sk⊤k−v⊤k

State gradient descent formula with weight decay wt and learning rate ηt:

St=St−1⋅diag(wt)−(St−1kt⊤kt−vt⊤kt)diag(ηt)

Equivalent to:

St=St−1(diag(wt)−kt⊤ktdiag(ηt))+vt⊤ktdiag(ηt)

Generalized formula for RWKV-7:

St=St−1(diag(wt)+at⊤bt)+vt⊤kt

Where reasonable initial values are chosen as:

a=−k,b=k⋅η,v=v,k=k⋅η

Comparison of time step formulas and state update mechanisms between RWKV-7 and historical versions (RWKV5/6):

Model VersionTime Step FormulaState Update Mechanism
RWKV-5 EagleSt=(w0⋯0⋯⋯⋯0⋯wn)St−1Trainable State Decay
RWKV-6 FinchSt=(wt,0⋯0⋯⋯⋯0⋯wt,n)St−1Dynamic State Decay
RWKV-7 GooseSt=(wt,0,0⋯wt,0,n⋯⋯⋯wt,n,0⋯wt,n,n)St−1Dynamic State Evolution

Core Mechanism Code

def ref_fwd(r, w, k, v, a, b):
    r = r.view(B, T, H, N)
    k = k.view(B, T, H, N)
    v = v.view(B, T, H, N)
    a = a.view(B, T, H, N)
    b = b.view(B, T, H, N)
    w = torch.exp(-torch.exp(w.view(B, T, H, N)))
    out = torch.zeros((B, T, H, N), device=DEVICE)
    state = torch.zeros((B, H, N, N), device=DEVICE)

    for t in range(T):
        kk = k[:, t, :]
        rr = r[:, t, :]
        vv = v[:, t, :]
        aa = a[:, t, :]
        bb = b[:, t, :]
        sab = torch.einsum('bhik,bhk,bhj->bhij', state, aa, bb)
        state = state * w[: , t, :, None, :] + sab + torch.einsum('bhj,bhi->bhij', kk, vv)
        out[:, t, :] = torch.einsum('bhj,bhij->bhi', rr, state)

    return out.view((B, T, C))

RWKV Architecture Features

The characteristics of the RWKV large model architecture include:

  • Efficient and stable inference speed
  • Low and fixed memory usage (supports running on CPU)
  • Capable of handling infinite context, very suitable for long text processing and multi-round dialogue applications
  • Hardware-friendly, only performs matrix and vector multiplication operations, no KV cache

The RWKV architecture consists of a series of stacked residual blocks, each of which consists of time-mixing and channel-mixing sub-blocks with a recurrent structure, which is achieved by linear interpolation between the current input and the last input (in the RWKV-4 architecture paper, this process is called token shift). RWKV 6 optimized the token shift process by borrowing LoRA technology, making the simple linear interpolation (lerp) of RWKV4/5 into a data-dependent, dynamic linear interpolation (ddlerp).

According to the comparison of the inference complexity of different models, the time complexity of Transformer is: O (T^2), and the space complexity is: O (T^2), so the inference speed will become slower and slower, and the memory consumption will also increase. The time complexity of RWKV is: O(T), and the space complexity is O(1). The RWKV large model achieves constant inference speed through the optimization of the calculation process, greatly reducing the time consumption during inference.

In addition, the RWKV architecture design significantly reduces memory usage, making the model efficient on standard configuration CPUs or non-professional GPUs rather than expensive or high-end computing hardware. This breakthrough progress makes large-scale deep learning models no longer limited to specific hardware platforms, thus broadening the application range.

RWKV Architecture References

For core concepts of the RWKV architecture, please refer to RWKV in 150 lines of code.

For the architectural design of RWKV-7, you can refer to the blog: RWKV-7 as a meta-context learner.

Alternatively, you can learn by reading RWKV papers:

  • RWKV-4 Architecture Paper | arXiv (2305.13048)
  • RWKV 5/6 Architecture Paper | arXiv (2404.05892)
  • RWKV-7 Architecture Paper | arXiv(2503.14456)

If you have mastered the basics, you can start studying the training and CUDA code of RWKV in the RWKV main repository.

Overview paper on the development of the RWKV architecture: Overview of the Development of the RWKV Architecture | arXiv (2411.02795)

A collection of papers and resources related to a survey of RWKV: A Survey of RWKV | arXiv (2412.14847)

Edit this page
Last Updated:
Contributors: luoqiqi, manjuan
Prev
RWKV Training Environment
Next
RWKV pip Usage Guide