RunRL
Python SDK

Python File Formats

Python-specific implementation details for reward functions and environment classes

Python File Formats

This guide provides Python-specific implementation details for reward functions and environment classes used with RunRL.

Reward Functions

Function Signature

def reward_fn(completion, **kwargs):
    """
    Evaluate the model response and return a reward score.

    Args:
        completion: The model's response, formatted as list of messages
            e.g. [{"role": "assistant", "content": "This is an example reponse."}]
        kwargs: Additional arguments given by any keywords other than "prompt" in the prompts.jsonl file

    Returns:
        float: Reward score (higher is better)
    """
    # To do this the jsonl must have an "expected_result" field for every line
    expected = kwargs['expected_result']
    response = completion[0]["content"]

    # Example: Check if the response contains expected answer
    if expected and str(expected) in response:
        return 1.0

    return 0.5

Parameters

  • completion: The model's response, formatted as a list of messages (e.g., [{"role": "assistant", "content": "..."}])
  • **kwargs: Additional arguments from your prompt JSONL file entries (any field other than "prompt"). For example, if your JSONL includes an expected_result field, you can access it with kwargs['expected_result']

Return Value

The function must return a single float value representing the reward score for the completion. Higher values indicate better performance.

Example: Math Problem Evaluator

This example reward function evaluates math problems, rewarding correct answers and penalizing the use of numerals in the thinking process:

def extract_answer(str):
    # Extracts answer from between <answer> tags
    if "<answer>" in str and "</answer>" in str:
        return str.split("<answer>")[1].split("</answer>")[0]
    else:
        return ""

def response_penalty(str):
    # Penalizes use of digits in the response
    return sum(1 for c in str if c.isdigit())

def reward_fn(completion, **kwargs):
    # Get expected result from kwargs
    expected_result = kwargs.get('expected_result')
    if expected_result is None:
        raise ValueError("expected_result is required for this reward function")

    # Get model response from completion
    model_response = completion[0]['content']

    # Extract answer from response
    extracted_answer = extract_answer(model_response)

    # Calculate correctness reward (10.0 for correct, 0.0 for incorrect)
    response_correctness_reward = 10.0 if extracted_answer == str(expected_result) else 0.0

    # Calculate penalty for using digits
    response_penalty = -response_penalty(model_response)

    # Combine reward and penalty
    reward = response_correctness_reward + response_penalty

    return reward

Best Practices

  • Document your reward criteria clearly
  • Handle missing or invalid inputs gracefully
  • Use helper functions for complex logic
  • Consider both positive rewards for desired behaviors and penalties for undesired behaviors
  • Ensure your reward function produces varied scores - if the model receives only minimum or maximum rewards, it cannot learn to distinguish between better and worse responses
  • Test your reward function with sample completions before using it for training

Environment Classes

Class Structure

class CustomEnv:
    def setup(self, **kwargs):
        # Initialize your environment state here
        # kwargs contains any parameters from the JSONL file
        self.max_steps = kwargs.get("max_steps", 10)
        self.current_step = 0
        self.state = None

    def step(self, action: str):
        # Process the action and return the result
        self.current_step += 1

        # Your environment logic here
        # action is the model's output as a string

        done = self.current_step >= self.max_steps

        return {
            "observation": "Environment response to action",
            "reward": 0.0,  # Float reward value
            "done": done    # Boolean indicating episode end
        }

Methods

setup(**kwargs)

Called once at the beginning of each episode to initialize the environment state.

  • **kwargs: Parameters from your JSONL file that configure the environment (e.g., max_steps, difficulty settings, initial conditions)

step(action: str)

Called for each interaction with the environment.

  • action: The model's output as a string
  • Returns a dictionary with:
    • observation: String describing the new state or feedback to the model
    • reward: Float value indicating the reward for this step
    • done: Boolean indicating whether the episode should end

Example: Math Problem Environment

class MathProblemEnv:
    def setup(self, **kwargs):
        self.problem = kwargs.get("problem", "2 + 2")
        self.expected_answer = kwargs.get("expected_answer", "4")
        self.max_attempts = kwargs.get("max_attempts", 3)
        self.current_attempt = 0
        self.solved = False

    def step(self, action: str):
        self.current_attempt += 1

        # Check if the action contains the correct answer
        if str(self.expected_answer) in action:
            self.solved = True
            return {
                "observation": "Correct! You solved the problem.",
                "reward": 10.0,
                "done": True
            }

        # Check if max attempts reached
        if self.current_attempt >= self.max_attempts:
            return {
                "observation": f"Maximum attempts reached. The answer was {self.expected_answer}.",
                "reward": 0.0,
                "done": True
            }

        # Provide hint for next attempt
        remaining = self.max_attempts - self.current_attempt
        return {
            "observation": f"Incorrect. You have {remaining} attempt(s) remaining. Try again.",
            "reward": -1.0,
            "done": False
        }

Best Practices

  • Initialize all necessary state variables in setup()
  • Keep track of step counts and episode termination conditions
  • Provide informative observations that help guide the model
  • Design reward structures that encourage desired behaviors
  • Ensure your environment produces a range of reward values across episodes - consistent all-zero or all-maximum rewards prevent the model from learning effectively
  • Handle edge cases and invalid actions gracefully
  • Test your environment thoroughly before training

On this page