Python-specific implementation details for reward functions and environment classes

Python File Formats

This guide provides Python-specific implementation details for reward functions and environment classes used with RunRL.

Reward Functions

Function Signature

def reward_fn(completion, **kwargs):
    """
    Evaluate the model response and return a reward score.

    Args:
        completion: The model's response, formatted as list of messages
            e.g. [{"role": "assistant", "content": "This is an example reponse."}]
        kwargs: Additional arguments given by any keywords other than "prompt" in the prompts.jsonl file

    Returns:
        float: Reward score (higher is better)
    """
    # To do this the jsonl must have an "expected_result" field for every line
    expected = kwargs['expected_result']
    response = completion[0]["content"]

    # Example: Check if the response contains expected answer
    if expected and str(expected) in response:
        return 1.0

    return 0.5

Parameters

completion: The model's response, formatted as a list of messages (e.g., [{"role": "assistant", "content": "..."}])
**kwargs: Additional arguments from your prompt JSONL file entries (any field other than "prompt"). For example, if your JSONL includes an expected_result field, you can access it with kwargs['expected_result']

Return Value

The function must return a single float value representing the reward score for the completion. Higher values indicate better performance.

Example: Math Problem Evaluator

This example reward function evaluates math problems, rewarding correct answers and penalizing the use of numerals in the thinking process:

def extract_answer(str):
    # Extracts answer from between <answer> tags
    if "<answer>" in str and "</answer>" in str:
        return str.split("<answer>")[1].split("</answer>")[0]
    else:
        return ""

def response_penalty(str):
    # Penalizes use of digits in the response
    return sum(1 for c in str if c.isdigit())

def reward_fn(completion, **kwargs):
    # Get expected result from kwargs
    expected_result = kwargs.get('expected_result')
    if expected_result is None:
        raise ValueError("expected_result is required for this reward function")

    # Get model response from completion
    model_response = completion[0]['content']

    # Extract answer from response
    extracted_answer = extract_answer(model_response)

    # Calculate correctness reward (10.0 for correct, 0.0 for incorrect)
    response_correctness_reward = 10.0 if extracted_answer == str(expected_result) else 0.0

    # Calculate penalty for using digits
    response_penalty = -response_penalty(model_response)

    # Combine reward and penalty
    reward = response_correctness_reward + response_penalty

    return reward

Best Practices

Document your reward criteria clearly
Handle missing or invalid inputs gracefully
Use helper functions for complex logic
Consider both positive rewards for desired behaviors and penalties for undesired behaviors
Ensure your reward function produces varied scores - if the model receives only minimum or maximum rewards, it cannot learn to distinguish between better and worse responses
Test your reward function with sample completions before using it for training

Environment Classes

Class Structure

class CustomEnv:
    def setup(self, **kwargs):
        # Initialize your environment state here
        # kwargs contains any parameters from the JSONL file
        self.max_steps = kwargs.get("max_steps", 10)
        self.current_step = 0
        self.state = None

    def step(self, action: str):
        # Process the action and return the result
        self.current_step += 1

        # Your environment logic here
        # action is the model's output as a string

        done = self.current_step >= self.max_steps

        return {
            "observation": "Environment response to action",
            "reward": 0.0,  # Float reward value
            "done": done    # Boolean indicating episode end
        }

Methods

`setup(**kwargs)`

Called once at the beginning of each episode to initialize the environment state.

**kwargs: Parameters from your JSONL file that configure the environment (e.g., max_steps, difficulty settings, initial conditions)

`step(action: str)`

Called for each interaction with the environment.

action: The model's output as a string
Returns a dictionary with:
- observation: String describing the new state or feedback to the model
- reward: Float value indicating the reward for this step
- done: Boolean indicating whether the episode should end

Example: Math Problem Environment

class MathProblemEnv:
    def setup(self, **kwargs):
        self.problem = kwargs.get("problem", "2 + 2")
        self.expected_answer = kwargs.get("expected_answer", "4")
        self.max_attempts = kwargs.get("max_attempts", 3)
        self.current_attempt = 0
        self.solved = False

    def step(self, action: str):
        self.current_attempt += 1

        # Check if the action contains the correct answer
        if str(self.expected_answer) in action:
            self.solved = True
            return {
                "observation": "Correct! You solved the problem.",
                "reward": 10.0,
                "done": True
            }

        # Check if max attempts reached
        if self.current_attempt >= self.max_attempts:
            return {
                "observation": f"Maximum attempts reached. The answer was {self.expected_answer}.",
                "reward": 0.0,
                "done": True
            }

        # Provide hint for next attempt
        remaining = self.max_attempts - self.current_attempt
        return {
            "observation": f"Incorrect. You have {remaining} attempt(s) remaining. Try again.",
            "reward": -1.0,
            "done": False
        }

Best Practices

Initialize all necessary state variables in setup()
Keep track of step counts and episode termination conditions
Provide informative observations that help guide the model
Design reward structures that encourage desired behaviors
Ensure your environment produces a range of reward values across episodes - consistent all-zero or all-maximum rewards prevent the model from learning effectively
Handle edge cases and invalid actions gracefully
Test your environment thoroughly before training

Python File Formats

On this page