Reinforcement learning for trading
A practical guide to reinforcement learning trading: how agents learn from rewards, why trading fits sequential decision-making, and pitfalls to avoid.
The core loop
RL is about decisions over time: act, observe outcome, update behaviour to maximize reward.
In reinforcement learning trading, an agent takes actions (trade/hold/size), observes new market state, and receives a reward that you define. During training (in backtest), it updates its policy to improve expected long-run reward under your constraints.
Why trading fits
Trading is sequential, delayed, and constrained—so RL can fit.
- Delayed outcomes — entries, exits, and risk controls can pay off many steps later.
- State + constraints — positions, inventory, limits, and fees matter to decisions.
- Reward is customizable — you can shape the objective toward what you care about (return, risk, drawdown, turnover).
Pitfalls to avoid
- Reward hacking — agents optimize what you specify, not what you intended.
- Leakage — future information in features or labeling can make backtests meaningless.
- Non-stationarity — markets change; performance can decay quickly without monitoring.
- Evaluation shortcuts — a single backtest is not evidence. You need walk-forward and stress tests.
How Kabu approaches RL trading
Kabu is designed around training RL agents in backtests and deploying a fixed policy live. That separation is intentional: live trading should be predictable and auditable; learning happens during training runs.