Terms you’ll see

What we mean when we say state, action, reward, policy—and how that maps to the product.

State, action, reward

The loop the agent runs in: see the current situation, choose an action, get feedback.

State — What the agent “sees” at each step: e.g. market data, your positions, time. You configure what goes into the state when you set up an experiment.
Action — What the agent can do. On Kabu that might be buy/sell/hold, or position sizes, or allocation across symbols—depending on the experiment type you choose.
Reward — The score you define (e.g. PnL, risk-adjusted return). During training the agent is optimised to maximise this over time.

Training

Policy

The “policy” is the strategy the agent follows—what action to take in each state.

When you create an experiment on Kabu, you pick one of the algorithms we support (PPO, SAC, or DQN). Each is a different way to learn a policy from experience. You don’t need to know the internals; you choose the one that fits your strategy (the product describes when each is a good fit), set your environment and reward, and the platform trains it and stores checkpoints. “Training an agent” means optimising that policy so its actions lead to better reward.

What is RL · Training

Training vs running live

During training the agent tries different things; when you deploy, it follows what it learned.

To get better, the agent has to explore—try actions that might not look best yet. While it’s training, that exploration is built in. When you promote a model to live, you’re typically using the trained policy in a more deterministic way: it follows the behaviour it learned, with little or no random exploration. So: training = learn by trying; live = act on what was learned.

RL in trading · What is RL · Where it comes from