What is reward hacking in RL trading?

Reward hacking is when an RL agent exploits loopholes in your reward definition or environment assumptions (like unrealistic fills or accounting edge cases) to score high in training while producing behavior you wouldn’t want live.

Should I optimize only PnL as a reward?

Often no. A pure PnL reward can encourage excess turnover or risk. Many systems include costs and explicit penalties for violating constraints to align learning with risk-adjusted objectives.

Reward design for RL trading

Reward design is the core lever in reinforcement learning trading. Learn how to avoid reward hacking and align rewards with risk and constraints.

Reward is the strategy

In RL trading, the reward function is how you encode what “good” behavior means.

Your agent will optimize what you specify—sometimes aggressively. A reward that only reflects raw PnL can create unintended behaviors: excessive turnover, risk concentration, or fragile strategies that fail outside the training regime.

Common failure mode: reward hacking

If there’s a loophole—an edge case in accounting, unrealistic execution assumptions, or a metric that can be gamed— an agent can find it. That’s why reward design and evaluation belong together.

Evaluating RL trading strategies

Practical guidelines

Include costs — turnover without costs is usually unrealistic.
Penalize constraint violations — make risk limits explicit.
Prefer simple shaping — add complexity only when it measurably improves out-of-sample stability.

Evaluation checklist