Preparing The Training Datasets

Follow the steps below to organize your training dataset of RWKV model.

Organizing jsonl Data

First, confirm what knowledge you want the RWKV model to learn. This knowledge can be materials from a certain field, such as legal Q&A, financial knowledge, etc.; or it can be texts for a certain type of task, such as material summarization, role-playing, etc.

In short, you need to collect the corresponding data according to the specific task requirements and organize it into a file in the jsonl format.

The following is a reference for the jsonl format for different content types/task types:

Single-round Question and Answer

Single-round question and answer is usually used to train downstream tasks such as chatbots. The data format is:

{"text": "User: Question\n\nAssistant: Answer"}

An example of a single-round question and answer:

{"text": "User: What is water?\n\nAssistant: Water is a colorless, odorless, and tasteless liquid, and it is one of the most common substances on Earth."}

In this example, the content after User: is usually the input given to the model by humans, and the content after Assistant: is the answer given by the model.

Tips

In addition to User and Assistant, you can also add the System role to provide background settings or strengthen the model's recognition of the Assistant: role.

{"text": "System: You are an excellent tour guide proficient in the history and geography of Guangdong.\n\nUser: Tour guide, what is the capital city of Guangdong?\n\nAssistant: The capital city of Guangdong is Guangzhou, which has a very long history."}
{"text": "System: It is the period of the Three Kingdoms, and the world is in chaos with many warlords vying for power. You are a soldier fighting against Zhang Fei.\n\nUser: Boy, take this sword from Zhang Fei!\n\nAssistant: Please spare my life, Brother Zhang Fei!"}

The System role is also applicable to the multi-round conversation data below.

Multi-round Conversation

Multi-round conversation data is suitable for task scenarios of continuous conversation and context understanding, such as customer service robots and role-playing.

The format of multi-round conversation data is:

{"text": "User: Question 1\n\nAssistant: Answer 1\n\nUser: Question 2\n\nAssistant: Answer 2"}

An example of a multi-round conversation:

{"text": "User: Good evening!\nNice to meet you!\n\nAssistant: Good evening!\nNice to meet you too!\n\nUser: I'm ten years old this year.\nHow old are you?\n\nAssistant: I'm five years old this year."}

Warning

Note that User: and Assistant: need to be separated by \n\n. But line breaks within the conversation content can only be represented by \n.

Instruction Question and Answer

Instruction-based question and answer data is suitable for summarization tasks such as information extraction, material summarization, and meeting minutes, and it is also the recommended format for instruction tuning.

{"text": "Instruction: Instruction\n\nInput: Content\n\nResponse: Answer"}

Among them, Instruction is the instruction given to the model, Input is the content input given to the model, and Response is the answer given by the model.

Tips

An English space should be inserted between Instruction:, Input:, Response: and the text content.

In addition, Instruction:, Input:, and Response: need to be separated by \n\n. But line breaks within the conversation content can only be represented by \n.

An example of an instruction-based question and answer:

{ "text": "Instruction: Please determine which category the following sentence belongs to. The categories include culture, entertainment, sports, finance, real estate, automobiles, education, technology, military, tourism, international affairs, securities, agriculture, e-sports, and people's livelihood. Please directly output the category without any additional content.\n\nInput: The RWKV large model officially launched the seventh-generation architecture RWKV-7.\n\nResponse: Technology"}

The content of Instruction: and Input: will be concatenated and used as the input to the model, and the content of Response: is the answer given by the model.

In this example, the model will receive the following input:

Please determine which category the following sentence belongs to. The categories include culture, entertainment, sports, finance, real estate, automobiles, education, technology, military, tourism, international affairs, securities, agriculture, e-sports, and people's livelihood. Please directly output the category without any additional content.
The RWKV large model officially launched the seventh-generation architecture RWKV-7.

The model will give the following output:

Technology

Long Text Data(Articles/Novels)

Long text data such as articles and novels is usually used to train continuous long text generation tasks such as text continuation and text expansion.

For long text content such as whole novels and extremely long articles, the data format is:

{"text": "Turn the content of each article into one line of JSONL, even if it is a novel with one million words."}

For short content with titles such as news and announcements, the data format is:

{"text": "《Title》\nBody content"}

For the task of continuing a single paragraph of a novel or article, the data format is:

{"text": "User: The beginning of a paragraph of about 100 words\n\nAssistant: The subsequent text of the paragraph"}

For the task of expanding a novel paragraph from a novel outline, the data format is:

{"text": "User: The outline of the chapter\n\nAssistant: The complete content of the chapter"}

More Details of the Training Data

How Much Data is Needed

There is no strict specification for the quantity of fine-tuning training data. It can be several hundred pieces, several thousand pieces, or even more.

Generally, the more data there is, the better the effect of fine-tuning training will be. But high-quality data requires a large amount of time and human resources. Therefore, it needs to be adjusted according to the actual situation:

Adjust according to the complexity of the fine-tuning task: For simple classification or summarization tasks, maybe only a few hundred pieces of data are needed. For complex role-playing or text generation tasks, several thousand pieces or more of data are required.
Quality is more important than quantity: High-quality data should accurately reflect the characteristics of the target task and cover diverse scenarios and expressions.

The adjustment of training data is a step-by-step iterative process, which usually needs to go through the following steps:

ft-dataset-mermaid

Until a satisfactory fine-tuning effect is obtained.

Duplicating and Shuffle the Data

When the training data is insufficient, duplicating the training data helps to enhance the model's understanding and memory of specific knowledge, and shuffling the data order is to reduce the risk of overfitting.

On Linux or Mac systems, use the following commands to repeat and shuffle the data file:

# Repeat the data.jsonl file three times and output all lines to the repeated-data.jsonl file
# This command can also be used to merge multiple jsonl data files
awk 'NR > 1 && NF == 0 {next} {print}' data.jsonl data.jsonl data.jsonl > repeated-data.jsonl
# Shuffle the lines of the repeated-data.jsonl file and output the shuffled result to the shuffled-data.jsonl file
shuf repeated-data.jsonl > shuffled-data.jsonl

Tips

data.jsonl needs to be replaced with the name of the jsonl data file you have prepared.

Tips

On Windows systems, you can use WSL (Windows Subsystem for Linux) or install the Cygwin tool to use the awk and shuf commands.

Adding Conventional Data

It is recommended to add some conventional data to the fine-tuning dataset. Adding conventional data helps to enhance the model's generalization ability and reduce the risk of overfitting at the same time.

Suppose we are fine-tuning a model for answering elementary math problems, and the data samples are similar to this:

{"text": "User: 1 + 1 = ?\n\nAssistant: 2"}
{"text": "User: 1 + 2 = ?\n\nAssistant: 3"}
{"text": "User: 1 + 3 = ?\n\nAssistant: 4"}

At this time, we can add some other expressions of math problems and conventional conversation data in non-math fields to the dataset, such as:

{"text": "User: 1 + 1 = ?\n\nAssistant: 2"}
{"text": "User: I have 5 apples and give 2 to Xiaoming. How many are left?\n\nAssistant: 3 apples"}
{"text": "User: What is the sum of 8 and 7?\n\nAssistant: 15"}
{"text": "User: The area of a rectangle is 20 square meters and the width is 4 meters. So what is its length?\n\nAssistant: The length is 5 meters."}
{"text": "User: What's the weather like today?\n\nAssistant: It's sunny today, and it's suitable for going out and having fun."}
{"text": "User: 1 + 2 = ?\n\nAssistant: 3"}

Converting jsonl Files to binidx Files

After obtaining the training data in the jsonl format, we need to use the json2binidx tool to convert the jsonl file into a binidx file that is more suitable for RWKV training.

The following prompt means that the conversion is complete:

json2binidx-trans-data

You should be able to find the converted bin/idx files in the data folder:

binidx-dataset

At this time, our preparatory work is over. Next, we need to select a fine-tuning method and read the corresponding documentation in the following articles.