Where it comes from

The idea behind RL: learning from experience and feedback, with examples and how it compares to other AI.

Learning like a child

Humans don’t get a manual of correct actions—they learn by trying, observing, and getting feedback from the world.

A baby doesn’t learn to walk by being told the exact sequence of muscle movements. They try, fall, get up, and over time the outcomes that work (staying upright, moving forward) get reinforced while the ones that don’t (falling, bumping into things) steer them away. The same goes for learning to talk, avoid a hot stove, or ride a bike: the environment and the consequences of actions provide the signal. There’s no teacher handing out a list of right answers—just experience and feedback. Reinforcement learning is the same idea applied to software: an agent tries things, gets a reward (or penalty) from the environment, and over time learns which actions work best.

What is RL

Behaviour and reward

The core idea—that behaviour can be shaped by what gets rewarded—has a long history in psychology.

In psychology, “reinforcement” has long referred to the fact that behaviour that’s rewarded tends to be repeated, and behaviour that isn’t (or is punished) tends to fade. The same idea shows up in how we build agents: they don’t get a list of right answers; they get a reward signal and many chances to try, and over time they favour actions that work better. That’s the thread that runs from the early ideas to the tools we use today.

Where RL has proved very effective

Famous examples where agents trained with RL reached superhuman or human-level performance.

Go. Systems like AlphaGo learned to play Go at a level that beat world champions. The rules of the game and the outcome (win or lose) were the only “teacher”; the agent played huge numbers of games and learned from the result. No one programmed in opening theory or endgame patterns—the agent discovered effective play through trial and error and reward (winning).

Dota 2 and other video games. RL agents have been trained to play complex team games like Dota 2 at a high level, again with a simple reward (e.g. winning the match) and massive amounts of practice. The same idea has been used for other games (e.g. Atari, StarCraft): define the goal, let the agent explore, and it learns from experience.

These examples show that when the goal is clear and the environment can be simulated or repeated many times, RL can reach impressive performance. Trading is different in the details, but the same principle applies: you define the reward (e.g. returns or risk-adjusted performance), the agent trains in a backtest, and it learns which actions lead to better outcomes.

RL in trading

How RL compares to other AI

Different tools for different problems—knowing the difference helps you choose the right one.

Supervised learning needs labelled data: you show the system many examples of “correct” answers (e.g. “this image is a cat”, “this email is spam”). It’s great when you have or can create those labels. Unsupervised learning looks for structure in data without labels (e.g. clustering, patterns). Reinforcement learning is different: you define a reward and possibly the rules of the environment, but you don’t provide a list of correct actions. The agent learns from trial and error which actions maximise reward over time.

RL is a good fit when the problem is sequential (each action affects what happens next), the outcome is clear (you can score it), but you don’t have—or don’t want to hand-label—“right” actions. Trading fits that: you can define reward (e.g. returns), and the agent learns from experience in a backtest. That’s why we use RL for Kabu rather than a purely supervised or rule-based approach.

Terms you’ll see · Features

Why we mention it

So you know the ideas behind Kabu aren’t made up—they’re grounded in research and real-world results.

Kabu doesn’t teach you the theory. We give you a simple way to configure an agent, a reward, and an environment, then train and deploy. This page is just to say: the terms you see (“agent”, “reward”, “policy”) and the approach we use come from that tradition. You can use the product without any of this; it’s here if you’re curious.