RWKV Github
Model Weights
Hugging Face Integration
Community Discord
RWKV Github
Model Weights
Hugging Face Integration
Community Discord
  • RWKV Language Model
  • Getting Started
    • RWKV Architecture History
    • How to Experience RWKV
    • RWKV-Evals
    • RWKV Decoding Parameters
    • Integrate with your application
    • RWKV-Performance-Data
    • Frequently Asked Questions
  • RWKV Prompting
    • Prompting Format Guidelines
    • Chat Prompt Examples
    • Completion Prompt Examples
  • Advanced
    • RWKV-LM Pre-training Tutorial
    • RWKV-FLA User Guide
    • Fine-tuning
    • Preparing The Training Datasets
    • RWKV Training Environment
  • Inference Tutorials
    • RWKV pip Usage Guide
    • llama.cpp Inference
    • Ollama Inference
    • Silly Tavern Inference
    • Text Generation WebUI Inference
    • KoboldCpp Inference
    • Ai00 Inference
  • Fine Tune Tutorials
    • State Tuning Tutorial
    • LoRA Fine-Tuning Tutorial
    • MiSS Fine-Tuning Tutorial
    • PiSSA Fine-Tuning Tutorial
    • FAQ about Fine-Tuning
  • Community
    • Code Of Conduct
    • Contributing to RWKV
    • Various RWKV related links

RWKV-Evals

RWKV Evaluation Data presents RWKV's performance across various large language model benchmarks, including Uncheatable Eval, MMLU, RULER, and LongBench.

Uncheatable Eval Test

Tips

Uncheatable Eval is an "uncheatable evaluation" that uses real-time data, such as the latest academic papers and news articles, to assess the true modeling capabilities and generalization abilities of open-source large language models.

The result of the Uncheatable Eval test is the compression ratio; therefore, a lower score indicates better model performance.

Below is a comparison of Uncheatable Eval scores between RWKV and other models:

14B Parameter Models

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
RWKV7-g0b-13.3b13.2696.8439.8488.2027.6367.1087.3804.0263.892
Qwen3-14B-Base14.7686.84510.5698.4457.9427.0017.2103.4393.312
gemma-3-12b-pt12.1876.94510.5407.9147.6077.2867.3873.8833.997
Qwen2.5-14B14.7706.95110.5588.3177.9447.2247.3923.6253.599
Mistral-Nemo-Base-240712.2486.97010.1658.1187.6427.2877.4554.0794.042
Motif-2-12.7B-Base12.7047.09910.6288.3287.8977.1347.4044.1894.114
Llama-2-13b-hf13.0167.54010.6558.3077.9017.9938.1224.7955.009

7B Parameter Models

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
Qwen3-8B-Base8.1917.09110.8908.7188.2557.2077.4653.6173.482
Meta-Llama-3-8B8.0307.16210.6198.2957.7857.5367.5414.1744.181
RWKV7-g0a3-7.2b-20251029-ctx81927.1997.22210.1648.4807.9967.4407.7474.3784.347
Qwen2.5-7B7.6167.32311.0798.7298.4497.5397.7923.8683.806
Falcon-H1-7B-Base7.5867.33910.9588.5768.2257.4037.5694.2514.392
Mistral-7B-v0.17.2427.40610.6628.3067.9767.7457.9034.6124.635
Hunyuan-7B-Pretrain7.5057.54111.5098.9878.4997.6538.1084.2013.829
falcon-mamba-7b7.2737.54810.7608.9588.5897.6747.7374.4374.680
Zamba2-7B7.3577.58210.7028.6278.0747.8438.1244.8334.869
Minitron-8B-Base8.2727.58210.8358.6548.2847.8568.2304.5084.708
Olmo-3-1025-7B7.2987.59511.1018.7848.5227.4907.9474.9304.394
RWKV-x060-World-7B-v3-20241112-ctx40967.6367.63310.6298.7538.2887.9368.1094.7864.929

3B Parameter Models

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
RWKV7-g1a4-2.9b-20251118-ctx81922.9487.48610.4818.8008.3107.7128.0724.5534.474
Llama-3.2-3B3.2137.64311.2198.7018.3657.9288.0654.6614.562
Qwen2.5-3B3.0867.72211.5759.1398.8957.9118.2204.2034.113
SmolLM3-3B-Base3.0757.78411.1878.9058.6118.0978.6314.5134.546
RWKV-x070-World-2.9B-v3-20250211-ctx40962.9487.80010.8128.9098.5018.0498.3074.9555.066
stablelm-3b-4e1t2.7957.90711.2118.8158.4348.2998.4764.9065.207
Falcon-H1-3B-Base3.1497.93611.6859.1588.9107.8918.1614.8324.917
recurrentgemma-2b2.6838.05211.6328.9518.8358.4018.4884.8975.157
RWKV-x060-World-3B-v2.1-20240417-ctx40963.1008.14711.0059.1618.8158.4518.5595.4795.561
mamba2attn-2.7b2.6988.20411.4369.2468.9478.4748.2365.3365.751

1.5B Parameter Models

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
Qwen3-1.7B-Base1.7217.96512.0169.7439.3527.9368.3504.2604.095
RWKV7-g1b-1.5b-20251015-ctx81921.5277.96910.9729.2508.8438.1108.5375.0415.027
Qwen2.5-1.5B1.5448.12412.1149.5629.3938.2708.6464.5024.384
RWKV-x070-World-1.5B-v3-20250127-ctx40961.5278.23111.2739.3208.9658.4318.7585.3855.483
SmolLM2-1.7B1.7118.29811.5369.3739.3518.5479.0475.0805.152
Llama-3.2-1B1.2368.30612.0369.3319.0978.5568.7555.2675.101
Index-1.9B2.1738.34011.8319.4939.0698.4978.5615.3805.547
stablelm-2-1_6b1.6458.39611.7619.2378.9438.7629.0885.5585.425
Falcon-H1-1.5B-Deep-Base1.5558.50512.1449.6669.4828.4078.9685.4975.368
RWKV-x060-World-1B6-v2.1-20240328-ctx40961.6008.56411.4349.5559.2768.8228.9905.9065.968
Falcon-H1-1.5B-Base1.5558.63912.2879.7969.6458.5079.0895.6355.514
mamba2-1.3b1.3448.69911.9449.7109.4638.9258.7145.8516.286
RWKV-5-World-1B5-v2-20231025-ctx40961.5788.71511.5959.7319.4518.9779.1036.0396.110
mamba-1.4b-hf1.3728.80612.0269.7839.5529.0818.8365.9586.408

MMLU Test

Tips

MMLU (Massive Multitask Language Understanding) is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science.

ModelMMLUMMLU COT
RWKV7-g0b-13.3b0.7650.827
RWKV7-g0a3-7.2b0.6510.723
RWKV7-g1a4-2.9b0.6130.675
RWKV7-g1b-1.5b0.5050.542

MMLU Pro Test

Tips

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.

ModelMMLU-PROMMLU-PRO COT
RWKV7-g0b-13.3b0.5020.612
RWKV7-g0a3-7.2b0.3590.521
RWKV7-g1a4-2.9b0.3240.43
RWKV7-g1b-1.5b0.2220.292

GSM8K Test

Tips

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

ModelGSM8K
RWKV7-g0b-13.3b0.923
RWKV7-g0a3-7.2b0.839
RWKV7-g1a4-2.9b0.773
RWKV7-g1b-1.5b0.585

MATH500 Test

Tips

MATH-500 is an authoritative benchmark for measuring the mathematical reasoning capabilities of AI models. It contains 500 challenging mathematical problems covering algebra, geometry, calculus, probability, statistics, and other fields.

ModelMATH500
RWKV7-g0b-13.3b0.768
RWKV7-g0a3-7.2b0.678
RWKV7-g1a4-2.9b0.482
RWKV7-g1b-1.5b0.298

IFEval

Tips

IFEval (Instruction-Following Evaluation) is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".

ModelIFEval (strict prompt-level)
RWKV7-g0b-13.3b0.689
RWKV7-g0a3-7.2b0.58
RWKV7-g1a3-2.9b0.51
RWKV7-g1b-1.5b0.421

RULER Test

Tips

RULER is a new LLM testing method that optimizes and extends the NIAH (Needle In A Haystack) test. It includes four types of tasks: Enhanced Retrieval (Extended NIAH), Multi-hop Tracing, Information Aggregation (CWE, FWE), and QA with interference.

Enhanced Needle In A Haystack (NIAH)

RULER includes the Enhanced Needle In A Haystack (NIAH) test, divided into four sub-tasks to evaluate the model's retrieval capabilities:

Sub-taskBrief Description
Single NIAH (S-NIAH)Tests the model's ability to handle a single input and a single target output.
Multi-keys NIAH (MK-NIAH)Tests the model's ability to handle multiple key-value pairs, where each key is associated with a single output.
Multi-values NIAH (MV-NIAH)Tests the model's ability to handle multiple key-value pairs, where each key is associated with multiple values or outputs.
Multi-queries NIAH (MQ-NIAH)Tests the model's ability to synthesize and generate corresponding results under multiple query conditions.

Single NIAH test results are as follows:

ModelNIAH_single_1NIAH_single_2NIAH_single_3
RWKV-6-7B-v2.1-4k10098.6795
Llama2-7B-4k97.610096.8
Mamba-2.8B-4k100191
Mamba-1.4B-4k94215
RWKV-6-3B-v2.1-4k1008879
RWKV-6-1.6B-v2.1-4k985355

NIAH-Multi-keys test results are as follows:

ModelNIAH_multikey_1NIAH_multikey_2NIAH_multikey_3
RWKV-6-7B-v2.1-4k48.337.671.33
Llama2-7B-4k10084.460
Mamba-2.8B-4k701
Mamba-1.4B-4k800
RWKV-6-3B-v2.1-4k3610
RWKV-6-1.6B-v2.1-4k2510

Multi-values and Multi-queries NIAH test results are as follows:

ModelNIAH_multivalueNIAH_multiquery
RWKV-6-7B-v2.1-4k80.4283.67
Llama2-7B-4k9496.7
Mamba-2.8B-4k0.751.25
Mamba-1.4B-4k5.254.75
RWKV-6-3B-v2.1-4k38.540.75
RWKV-6-1.6B-v2.1-4k2520.75

Variable Tracking (VT)

Tips

Multi-hop Tracing - Variable Tracking: This task mainly checks whether the model can successfully identify and track entities (variables) and reference relationships with multi-hop connections within a long context. For example, given the assignment X1=V, then X2=X1, X3=X2 ... finally returning all variable names pointing to the value V.

ModelMulti-hop Tracing
RWKV-6-7B-v2.1-4k7.53
Llama2-7B-4k63.12
Mamba-2.8B-4k45
Mamba-1.4B-4k23.4
RWKV-6-3B-v2.1-4k11.8
RWKV-6-1.6B-v2.1-4k1.4

Information Aggregation (CWE, FWE)

Tips

Information Aggregation (CWE, FWE): This task involves Common Words Extraction (CWE) and Frequent Words Extraction (FWE), used to test the model's ability to aggregate common information across long contexts.

ModelCommon Words Extraction (CWE)Frequent Words Extraction (FWE)
RWKV-6-7B-v2.1-4k38.678.33
Llama2-7B-4k73.0478.8
Mamba-2.8B-4k253
Mamba-1.4B-4k15.557.33
RWKV-6-3B-v2.1-4k30.362.67
RWKV-6-1.6B-v2.1-4k1146.33

Question Answering (QA)

Tips

Question Answering (QA): This task adds distracting information to the input of existing short-context QA datasets to evaluate QA capabilities under various context sizes.

Modelqa_1qa_2
RWKV-6-7B-v2.1-4k4537
Llama2-7B-4k59.242
Mamba-2.8B-4k2318
Mamba-1.4B-4k2423
RWKV-6-3B-v2.1-4k3525
RWKV-6-1.6B-v2.1-4k3528

Data Source

RULER Data Source: https://github.com/Ojiyumm/RULER_RWKV

LongBench Test

Tips

LongBench is a benchmark for evaluating the long-text understanding capabilities of large language models.

LongBench consists of six major categories and twenty-one different bilingual (Chinese-English) tasks, covering critical long-text application scenarios such as Single-Document QA, Multi-Document QA, Summarization, Few-shot Learning, Synthetic Tasks, and Code Completion.

Below is a comparison of LongBench scores for RWKV and other models, with data tables presented by the six categories:

Single-Document QA

Single-Document QA includes the following four test tasks:

TaskTask Description
NarrativeQAQA based on stories or scripts, including understanding of characters, plots, themes, etc.
QasperQA based on single papers; questions are asked by NLP readers and answered by NLP practitioners.
MultiFieldQA-enAnswering English questions based on a single document from relatively diverse fields.
MultiFieldQA-zhAnswering Chinese questions based on a single document from relatively diverse fields.

Single-Document QA Test Results:

ModelNarrativeQAQasperMultiFieldQA-enMultiFieldQA-zh
GPT-3.5-Turbo-16k23.643.352.361.2
Llama2-7B-chat-4k18.719.236.811.9
LongChat-v1.5-7B-32k16.927.741.429.1
XGen-7B-8k18.018.137.714.8
InternLM-7B-8k12.116.723.433.6
ChatGLM2-6B-32k21.131.546.251.6
Vicuna-v1.5-7B-16k19.426.138.543.0
ChatGLM3-6B-32k26.043.351.762.3
Mamba_1B42.234.4411.3313.03
Mamba_2B82.324.898.156.83
Llama2-7B18.719.211.9036.8
Mistral-7B12.798.930.5517.91
RWKV-6-World-1B6-v2.14.5319.7922.9918.57
RWKV-6-World-3B-v2.12.8714.218.7821.49
RWKV-6-World-7b-v2.1-4k20.7540.236.0150.19

Multi-Document QA

Multi-Document QA includes the following four test tasks:

TaskTask Description
HotpotQAAnswering questions based on HotpotQA documents; involves many 2-hop questions written by native speakers based on two related paragraphs.
2WikiMultihopQAAnswering questions based on 2WikiMultihopQA data; composed of up to 5-hop questions synthesized via manually designed templates.
MuSiQueAnswering questions based on MuSiQue data; composed of simple questions requiring up to 4-hop reasoning.
DuReaderAnswering questions based on the Chinese DuReader dataset, containing 200k questions and 1M documents from Baidu Search and Baidu Zhidao.

Multi-Document QA Test Results:

ModelHotpotQA2WikiMQAMusiqueDuReader (zh)
GPT-3.5-Turbo-16k51.637.726.928.7
Llama2-7B-chat-4k25.432.89.45.2
LongChat-v1.5-7B-32k31.520.69.719.5
XGen-7B-8k29.721.110.311.0
InternLM-7B-8k28.722.89.011.1
ChatGLM2-6B-32k45.134.021.937.6
Vicuna-v1.5-7B-16k25.320.89.819.3
ChatGLM3-6B-32k54.444.940.444.78
Mamba_1B45.738.773.311.95
Mamba_2B85.498.453.4513.96
Llama2-7B25.432.89.45.2
Mistral-7B9.3911.174.5811.68
RWKV-6-World-1B6-v2.18.7211.863.9611.40
RWKV-6-World-3B-v2.16.799.644.1317.41
RWKV-6-World-7b-v2.1-4k22.7416.310.4928.01

Summarization

The Summarization category involves the following four test tasks:

TaskTask Description
GovReportSummarization task requiring summaries of government work reports.
QMSumSummarization task requiring summaries of meeting minutes based on user queries.
MultiNewsMulti-document summarization task requiring summaries based on multiple news articles.
VCSUMSummarization task requiring summaries of Chinese meeting minutes.

Summarization Test Results:

ModelGovReportQMSumMultiNewsVCSUM (zh)
GPT-3.5-Turbo-16k29.523.426.716.0
Llama2-7B-chat-4k27.320.825.80.2
LongChat-v1.5-7B-32k30.822.726.49.9
XGen-7B-8k27.320.526.22.2
InternLM-7B-8k9.715.922.812.4
ChatGLM2-6B-32k32.424.026.516.2
Vicuna-v1.5-7B-16k27.922.827.215.1
ChatGLM3-6B-32k36.823.927.917.8
Mamba_1B49.3410.8515.866.33
Mamba_2B810.4111.4218.946.1
Llama2-7B27.320.825.80.2
Mistral-7B28.8420.3222.794.1
RWKV-6-World-1B6-v2.117.5120.3621.5210.71
RWKV-6-World-3B-v2.119.212121.7610.18
RWKV-6-World-7b-v2.1-4k31.6421.3126.0615.19

Few-shot Learning

Few-shot Learning includes the following four test tasks:

TaskTask Description
TRECClassification task requiring question classification, containing 50 categories in total.
TriviaQASingle-document QA task, providing several Few-shot examples.
SAMSumDialogue summarization task, providing several Few-shot examples.
LSHTChinese classification task requiring news classification, containing 24 categories in total.

Few-shot Learning Test Results:

ModelTRECTriviaQASAMSumLSHT (zh)
GPT-3.5-Turbo-16k68.091.441.729.2
Llama2-7B-chat-4k61.577.840.719.8
LongChat-v1.5-7B-32k63.582.334.223.2
XGen-7B-8k65.577.825.320.5
InternLM-7B-8k52.077.821.215.2
ChatGLM2-6B-32k62.578.736.327.7
Vicuna-v1.5-7B-16k71.586.240.828.8
ChatGLM3-6B-32k79.087.138.242.0
Mamba_1B445.537.3312.568.5
Mamba_2B821.534.629.35
Llama2-7B61.577.840.719.8
Mistral-7B70.089.2643.7425.5
RWKV-6-World-1B6-v2.139.547.6413.5818.8
RWKV-6-World-3B-v2.151.557.1517.9515.2
RWKV-6-World-7b-v2.1-4k55.586.8944.2530.2

Synthetic Tasks

Synthetic Tasks include the following three test tasks:

TaskTask Description
PassageCountDetermine the total number of unique paragraphs among the given paragraphs.
PassageRetrieval-enGiven 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to.
PassageRetrieval-zhGiven several Chinese paragraphs from the C4 dataset, determine which paragraph the given summary belongs to.

Synthetic Tasks Test Results:

ModelPassage CountPassageRetrieval-enPassageRetrieval-zh
GPT-3.5-Turbo-16k4.571.077.5
Llama2-7B-chat-4k2.19.80.5
LongChat-v1.5-7B-32k1.030.57.6
XGen-7B-8k2.18.53.5
InternLM-7B-8k3.06.00.9
ChatGLM2-6B-32k1.577.064.5
Vicuna-v1.5-7B-16k6.54.55.0
ChatGLM3-6B-32k2.099.094.0
Mamba_1B40.453.323.81
Mamba_2B80.741.833.37
Llama2-7B2.19.80.5
Mistral-7B1.0512.516.75
RWKV-6-World-1B6-v2.104.254.16
RWKV-6-World-3B-v2.103.834.12
RWKV-6-World-7b-v2.1-4k534.554.22

Code Completion

Code Completion includes the following two test tasks:

TaskTask Description
LCCGiven a long piece of code, predict the next line of code.
RepoBench-PGiven code from multiple files in a GitHub repository (including inter-file dependencies), predict the next line of code.

Code Completion Test Results:

ModelLCCRepoBench-P
GPT-3.5-Turbo-16k54.753.6
Llama2-7B-chat-4k52.443.8
LongChat-v1.5-7B-32k53.055.3
XGen-7B-8k38.638.6
InternLM-7B-8k44.128.8
ChatGLM2-6B-32k55.649.9
Vicuna-v1.5-7B-16k51.043.5
ChatGLM3-6B-32k57.6654.76
Mamba_1B444.3341.86
Mamba_2B839.5324.38
Llama2-7B52.443.8
Mistral-7B70.6459.7
RWKV-6-World-1B6-v2.139.540.44
RWKV-6-World-3B-v2.140.0141.35
RWKV-6-World-7b-v2.1-4k73.8454.1

Comprehensive Score Comparison of RWKV, Mamba, and Llama2

ModelSingle DocQFew-shcSummarizationMulti Doc QACode CompletionSyntetic
RWKV-6-World-1B6-v2.116.47029.86817.5258.98539.9702.803
RWKV-6-World-3B-v2.114.33535.44318.0389.49340.6802.650
RWKV-6-World-7b-v2.1-4k36.78854.20323.55019.38563.97031.240
Mamba_1B47.75825.97310.5957.43843.0952.527
Mamba_2B85.54817.60511.7187.83831.9551.980
Llama2-7B21.65049.95018.52518.20048.1004.133
Mistral-7B17.53852.83319.0139.20565.1710.100

Data Source

Evaluation Data Source: https://github.com/Ojiyumm/LongBench_RWKV

Edit this page
Last Updated:
Contributors: luoqiqi, manjuan
Prev
How to Experience RWKV
Next
RWKV Decoding Parameters