RWKV raven avartar

RWKV Language Model

RWKV (pronounced as RWaKuV) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable).

RWKV is an Open Source, non profit group, under the linux foundation. Supported by our sponsors.

So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free sentence embedding. Moreover it's 100% attention-free.

RWKV architecture paper

RWKV paper coveropen in new window

Current Version Status

Versionv4 - Ravenv4 - Dovev5 - Eaglev6 - Finch
Paper๐ŸŽ“Paper Accepted @ EMNLP 2023open in new window(no architecture change)๐Ÿ”ง stable (current version)๐Ÿงช prototype
Overall Status๐ŸŒš EOL - Recommended to use v5 world instead๐ŸŒš EOL - Recommended to use v5 world insteadโœ… General Availability๐Ÿงช Early Training
0.4B modelFully Trained : rwkv-pile-430mopen in new windowFully Trainedopen in new windowโœ… Fully Trainedopen in new window๐Ÿงช Early Training
1.5B modelFully Trained : rwkv-raven-1b5open in new windowFully Trainedopen in new windowโœ… Fully Trainedopen in new window๐Ÿงช Early Training
3B modelFully Trained : rwkv-raven-3bopen in new windowFully Trainedopen in new windowโœ… Fully Trainedopen in new window๐Ÿงช Early Training
7B modelFully Trained : rwkv-raven-7bopen in new windowFully Trainedopen in new windowโœ… Fully Trainedopen in new window...
14B model / 7B 2T modelFully Trained : rwkv-raven-14bopen in new windownot-plannedscheduled...
8x7B MoE modelnot-plannednot-plannedscheduled...

TLDR vs Existing transformer models

Good

  • Lower resource usage (VRAM, CPU, GPU, etc) when running and training.
  • 10x to a 100x lower compute requirements compared to transformers with large context sizes.
  • Scales to any context length linearly (transformers scales quadratically)
  • Perform just as well, in terms of answer quality and capability
  • RWKV models are generally better trained in other languages (e.g. Chinese, Japanese, etc), then most existing OSS models

Bad

  • Is sensitive to prompt fomatting, you may need to change how you prompt the model
  • Is weaker at task that require lookback, so reorder your prompt accordingly
    • (e.g. Instead of saying "For the document above do X", which will require a lookback. Say "For the document below do X" instead)

Who sponsors the compute for RWKV?

RWKV is made possible, as an Open Source project, thanks to the large amount of GPU compute and researchers time contributions from

Without their invaluable support, we would not have been able to develop the core RWKV foundation models that you see today.


In addition, we would like to thank

For helping with GPU time, on smaller experiments, finetunes, and various models. Especially for those models that never get publically released in failed runs.

Quick RWKV community terminology

  • RWKV - The model architecture itself, code found at https://github.com/BlinkDL/RWKV-LMopen in new window
  • RWKV World - New base model that is being trained on a larger more diverse mix of dataset, which include samples from over a 100 languages. Partially instruction trained.
  • Raven - Official finetuned version of the base model, with instruction training
  • Base model / Pile Plus Model - RWKV Base model is currently trained on "The Pile" with additional mix of other datasets. This model is not instruction trained.

Which RWKV models should I be using?