Python SDK
Python File Formats
Python-specific implementation details for reward functions and environment classes
Python File Formats
This guide provides Python-specific implementation details for reward functions and environment classes used with RunRL.
Reward Functions
Function Signature
Parameters
completion
: The model's response, formatted as a list of messages (e.g.,[{"role": "assistant", "content": "..."}]
)**kwargs
: Additional arguments from your prompt JSONL file entries (any field other than "prompt"). For example, if your JSONL includes anexpected_result
field, you can access it withkwargs['expected_result']
Return Value
The function must return a single float
value representing the reward score for the completion. Higher values indicate better performance.
Example: Math Problem Evaluator
This example reward function evaluates math problems, rewarding correct answers and penalizing the use of numerals in the thinking process:
Best Practices
- Document your reward criteria clearly
- Handle missing or invalid inputs gracefully
- Use helper functions for complex logic
- Consider both positive rewards for desired behaviors and penalties for undesired behaviors
- Ensure your reward function produces varied scores - if the model receives only minimum or maximum rewards, it cannot learn to distinguish between better and worse responses
- Test your reward function with sample completions before using it for training
Environment Classes
Class Structure
Methods
setup(**kwargs)
Called once at the beginning of each episode to initialize the environment state.
**kwargs
: Parameters from your JSONL file that configure the environment (e.g.,max_steps
, difficulty settings, initial conditions)
step(action: str)
Called for each interaction with the environment.
action
: The model's output as a string- Returns a dictionary with:
observation
: String describing the new state or feedback to the modelreward
: Float value indicating the reward for this stepdone
: Boolean indicating whether the episode should end
Example: Math Problem Environment
Best Practices
- Initialize all necessary state variables in
setup()
- Keep track of step counts and episode termination conditions
- Provide informative observations that help guide the model
- Design reward structures that encourage desired behaviors
- Ensure your environment produces a range of reward values across episodes - consistent all-zero or all-maximum rewards prevent the model from learning effectively
- Handle edge cases and invalid actions gracefully
- Test your environment thoroughly before training