Interface and File Formats Guide
Overview
A run requires
- a prompt file
- either a reward file or an environment file
Prompt files contain the prompts the model will respond to.
Reward files contain a function to evaluate the model's responses, returning a reward value. Higher rewards correspond to desired behaviors.
Environment files contain an environment class with a setup function and a step function.
Reward Function vs. Environment: Choose a reward function if you only need single step interactions. Choose an environment class if you need multistep interactions with tool calling or feedback.
Prompt Files Requirements
Prompt files must be in JSONL (JSON Lines) format, with each line containing a complete, valid JSON object. The file extension should be .jsonl
.
Each line in your prompt file must include:
prompt
: An array of message objects, where each message has:role
: The role of the speaker (e.g., "system", "user", "assistant")content
: The message text
You can include additional fields that your reward function might need for evaluation. Common examples include:
expected_result
: The expected answer or outputdifficulty
: A difficulty level indicatorcategory
: A categorization tag- Any custom metadata your reward function requires
Example Prompt File
Best Practices
- Include a diverse distribution of prompts that represent your use case
- Generally, aim for at least 100 distinct examples for effective training
- Consider including difficulty levels or categories if your task has natural groupings
Reward Functions
Requirements: The reward function file must be a Python file and contain a reward function with a signature that matches either
-
def reward_fn(completion: str, **kwargs) -> float
or
-
def reward_fn(completion: str, **kwargs) -> tuple[float, Dict[str, float | str]]
Inputs:
completion
: A string - the model's response.**kwargs
: Contains all fields from the corresponding JSONL line in the prompt file. For example, if your prompt line includes{"prompt": [...], "expected_result": 42, "difficulty": "hard"}
, then your reward function will receiveexpected_result=42
anddifficulty="hard"
as keyword arguments. This allows you to access any metadata needed for evaluation.
Outputs: There are two valid output formats.
- returning a numerical score, the reward
- returning a tuple where the first element is the numerical reward and the second item is a dictionary of information to be tracked
- Each key in the info dict should be a string and the corresponding value either a float or a string
- If it's a float the average of this metric for each step will be displayed on a graph
- If it's a string, for each step 5 randomly sampled examples (along with the prompt) will be displayed in a table.
Global variables can be used and declared outside of the reward function. If they are stateful, make sure to write thread safe code.
Reward Design Principles
- Varied Scores: Ensure the reward function produces a range of values. If all responses get the same reward, the model cannot learn to distinguish good from bad.
- Clear Signal: Higher rewards should consistently indicate better performance
- Robust Evaluation: Handle edge cases and unexpected inputs gracefully
- Low Hackability: Make it difficult for the model to cheat the reward function
Environment Classes
Requirements: The environment class file must be a python file and contain a environment class function that conforms to the following interface.
setup
Inputs: kwargs
contains all fields from the corresponding line in the prompt file (like in the reward function formulation)
step
Inputs: action
is a string, the model's response (like completion
in the reward function formulation)
step
Outputs: step
returns a dict containing the following
observation
: the environment's response to the action. For example, in a game of 20 questions, the observation might beyes
orno
in response to the question posed in the action.reward
: the numerical reward for that stepdone
:True
if the episode is complete, otherwiseFalse
reward_info_dict
: a dictionary of information to be tracked (same as optional dictionary returned in the reward function formulation)- Each key in the info dict should be a string and the corresponding value either a float or a string
- If it's a float the average of this metric for each step will be displayed on a graph
- If it's a string, for each step 5 randomly sampled examples (along with the prompt) will be displayed in a table.
For each prompt, an environment will be initialized. Then step
will be called repeatedly with the model's response to initial prompt, and any conversation history (previous responses and observations).
Best Practices
- To prevent excessively long trajectories, the environment should limit the number of steps
- The reward function design principles also apply to the environment rewards.
Optimizing the Reward Function/Environment Class
- RunRL parallelizes reward computation (and init/steps for environments), so don't worry about making things asynchronous.
- However, declaring expensive or reused resources (e.g. API clients) as global variables outside the reward function/environment class can greatly speed up the computation.