Reinforcement learning for trading

A practical guide to reinforcement learning trading: how agents learn from rewards, why trading fits sequential decision-making, and pitfalls to avoid.

The core loop

RL is about decisions over time: act, observe outcome, update behaviour to maximize reward.

In reinforcement learning trading, an agent takes actions (trade/hold/size), observes new market state, and receives a reward that you define. During training (in backtest), it updates its policy to improve expected long-run reward under your constraints.

Why trading fits

Trading is sequential, delayed, and constrained—so RL can fit.

  • Delayed outcomes — entries, exits, and risk controls can pay off many steps later.
  • State + constraints — positions, inventory, limits, and fees matter to decisions.
  • Reward is customizable — you can shape the objective toward what you care about (return, risk, drawdown, turnover).

Pitfalls to avoid

  • Reward hacking — agents optimize what you specify, not what you intended.
  • Leakage — future information in features or labeling can make backtests meaningless.
  • Non-stationarity — markets change; performance can decay quickly without monitoring.
  • Evaluation shortcuts — a single backtest is not evidence. You need walk-forward and stress tests.

How Kabu approaches RL trading

Kabu is designed around training RL agents in backtests and deploying a fixed policy live. That separation is intentional: live trading should be predictable and auditable; learning happens during training runs.

Next: evaluating RL trading strategies

© Kabu. All rights reserved.

Built with 🩵 by Zima Blue