Reinforcement learning for trading

A practical guide to reinforcement learning trading: how agents learn from rewards, why trading fits sequential decision-making, and pitfalls to avoid.

The core loop

RL is about decisions over time: act, observe outcome, update behaviour to maximize reward.

In reinforcement learning trading, an agent takes actions (trade/hold/size), observes new market state, and receives a reward that you define. During training (in backtest), it updates its policy to improve expected long-run reward under your constraints.

Why trading fits

Trading is sequential, delayed, and constrained—so RL can fit.

Delayed outcomes — entries, exits, and risk controls can pay off many steps later.
State + constraints — positions, inventory, limits, and fees matter to decisions.
Reward is customizable — you can shape the objective toward what you care about (return, risk, drawdown, turnover).

Pitfalls to avoid

Reward hacking — agents optimize what you specify, not what you intended.
Leakage — future information in features or labeling can make backtests meaningless.
Non-stationarity — markets change; performance can decay quickly without monitoring.
Evaluation shortcuts — a single backtest is not evidence. You need walk-forward and stress tests.

How Kabu approaches RL trading

Kabu is designed around training RL agents in backtests and deploying a fixed policy live. That separation is intentional: live trading should be predictable and auditable; learning happens during training runs.

Next: evaluating RL trading strategies