RunRL

Interface and File Formats Guide

Overview

A run requires

  • a prompt file
  • either a reward file or an environment file

Prompt files contain the prompts the model will respond to.

Reward files contain a function to evaluate the model's responses, returning a reward value. Higher rewards correspond to desired behaviors.

Environment files contain an environment class with a setup function and a step function.

Reward Function vs. Environment: Choose a reward function if you only need single step interactions. Choose an environment class if you need multistep interactions with tool calling or feedback.

Prompt Files Requirements

Prompt files must be in JSONL (JSON Lines) format, with each line containing a complete, valid JSON object. The file extension should be .jsonl. Each line in your prompt file must include:

  • prompt: An array of message objects, where each message has:
    • role: The role of the speaker (e.g., "system", "user", "assistant")
    • content: The message text

You can include additional fields that your reward function might need for evaluation. Common examples include:

  • expected_result: The expected answer or output
  • difficulty: A difficulty level indicator
  • category: A categorization tag
  • Any custom metadata your reward function requires

Example Prompt File

{"prompt":[{"role":"system","content":"You are an expert at mathematics."},{"role":"user","content":"What is the value of this expression: { ((54 - 140 * 118 + 130) + 197) + 46 }? Think for as long as you want, and then give your answer inside of <answer></answer> XML tags."}],"expected_result":-16093}
{"prompt":[{"role":"system","content":"You are an expert at mathematics."},{"role":"user","content":"What is the value of this expression: { (56 + 141 + 74) + 135 }? Think for as long as you want, and then give your answer inside of <answer></answer> XML tags."}],"expected_result":406}
{"prompt":[{"role":"system","content":"You are an expert at mathematics."},{"role":"user","content":"What is the value of this expression: { (194 + 92 + 120) + 26 - 22 }? Think for as long as you want, and then give your answer inside of <answer></answer> XML tags."}],"expected_result":410}

Best Practices

  • Include a diverse distribution of prompts that represent your use case
    • Generally, aim for at least 100 distinct examples for effective training
  • Consider including difficulty levels or categories if your task has natural groupings

Reward Functions

Requirements: The reward function file must be a Python file and contain a reward function with a signature that matches either

  • def reward_fn(completion: str, **kwargs) -> float

    or

  • def reward_fn(completion: str, **kwargs) -> tuple[float, Dict[str, float | str]]

Inputs:

  • completion: A string - the model's response.
  • **kwargs: Contains all fields from the corresponding JSONL line in the prompt file. For example, if your prompt line includes {"prompt": [...], "expected_result": 42, "difficulty": "hard"}, then your reward function will receive expected_result=42 and difficulty="hard" as keyword arguments. This allows you to access any metadata needed for evaluation.

Outputs: There are two valid output formats.

  • returning a numerical score, the reward
  • returning a tuple where the first element is the numerical reward and the second item is a dictionary of information to be tracked
    • Each key in the info dict should be a string and the corresponding value either a float or a string
    • If it's a float the average of this metric for each step will be displayed on a graph
    • If it's a string, for each step 5 randomly sampled examples (along with the prompt) will be displayed in a table.

Global variables can be used and declared outside of the reward function. If they are stateful, make sure to write thread safe code.

Reward Design Principles

  • Varied Scores: Ensure the reward function produces a range of values. If all responses get the same reward, the model cannot learn to distinguish good from bad.
  • Clear Signal: Higher rewards should consistently indicate better performance
  • Robust Evaluation: Handle edge cases and unexpected inputs gracefully
  • Low Hackability: Make it difficult for the model to cheat the reward function

Environment Classes

Requirements: The environment class file must be a python file and contain a environment class function that conforms to the following interface.

class CustomEnv:
    def setup(self, **kwargs) -> None:
        """Initialize the environment."""
        pass
    
    def step(self, action: str) -> Dict:
        """
        Process an action and return the result.
        
        Returns:
            Dict with keys:
                - 'observation': str, the next observation
                - 'reward': float, the reward for this step
                - 'done': bool, whether the episode is complete
                - 'reward_info_dict': Optional[Dict[str, Union[str, float]]], additional information to be plotted
        """
        pass

setup Inputs: kwargs contains all fields from the corresponding line in the prompt file (like in the reward function formulation) step Inputs: action is a string, the model's response (like completion in the reward function formulation) step Outputs: step returns a dict containing the following

  • observation: the environment's response to the action. For example, in a game of 20 questions, the observation might be yes or no in response to the question posed in the action.
  • reward: the numerical reward for that step
  • done: True if the episode is complete, otherwise False
  • reward_info_dict: a dictionary of information to be tracked (same as optional dictionary returned in the reward function formulation)
    • Each key in the info dict should be a string and the corresponding value either a float or a string
    • If it's a float the average of this metric for each step will be displayed on a graph
    • If it's a string, for each step 5 randomly sampled examples (along with the prompt) will be displayed in a table.

For each prompt, an environment will be initialized. Then step will be called repeatedly with the model's response to initial prompt, and any conversation history (previous responses and observations).

Best Practices

  • To prevent excessively long trajectories, the environment should limit the number of steps
  • The reward function design principles also apply to the environment rewards.

Optimizing the Reward Function/Environment Class

  • RunRL parallelizes reward computation (and init/steps for environments), so don't worry about making things asynchronous.
  • However, declaring expensive or reused resources (e.g. API clients) as global variables outside the reward function/environment class can greatly speed up the computation.

On this page