You might have read about Reinforcement Learning when browsing through stories about AlphaGo – the algorithm that has taught itself to play the game of GO and beat an expert human player – and might have found the technology fascinating.
However, as the subject’s inherently complex and doesn’t seem that promising from a business point of view, you might not have thought it useful to explore it deeply.
Well, turns out RL’s lack of practical benefits is a misconception; there are actually quite a few ways companies can use it right now.
In this post, we’ll list possible reinforcement learning applications and explain without technical jargon how RL works in general.
Supervised Learning, Unsupervised Learning, and Reinforcement Learning
So, in conventional supervised learning, as per our recent post, we have input/output (x/y) pairs (e.g labeled data) that we use to train machines with. Knowing the results for every input, we let the algorithm determine a function that maps Xs->Ys and we keep correcting the model every time it makes a prediction/classification mistake (by doing backward propagation and twitching the function.) We continue this kind of training until the results the algorithm produces are satisfactory.
In conventional unsupervised learning, we have data without labels and we introduce the dataset to our algorithm hoping that it’ll unveil some hidden structure within it.
Reinforcement learning solves a different kind of problem. In RL, there’s an agent that interacts with a certain environment, thus changing its state, and receives rewards (or penalties) for its input. Its goal is to find patterns of actions, by trying them all and comparing the results, that yield the most reward points.
One of the key features of RL is that the agent’s actions might not affect the immediate state of the environment but impact the subsequent ones. So, sometimes, the machine doesn’t learn whether a certain action is effective until much later in the episode.
Besides that, there’s the so-called exploitation/exploration trade-off dilemma.
Aiming to maximize the numerical reward, the agent has to lean toward actions that, it knows, lead to positive results and avoid the ones that don’t. This is called exploitation of the agent’s knowledge.
However, to find out which actions are correct the first place it must try them out and run the risk of getting a penalty. This is known as exploration.
Balancing exploitation and exploration is one of the key challenges in Reinforcement Learning and an issue that doesn’t arise at all in pure forms of supervised and unsupervised learning.
Apart from the agent and the environment, there are also these four elements in every RL system:
Policy. The ways the agent acts given a certain state of the environment; these can be defined by a simple function or involve some extensive computations. Think of them as machine’s stimulus-response rules or associations.
Reward signals define whether a policy should be changed or not. The agent’s sole purpose, as we’ve mentioned, is to maximize the numerical reward so, based on this signal, it can draw conclusions as to which actions are good or bad.
Value functions play a crucial role in shaping the agent’s behavior too but, unlike reward signals which assess actions in the immediate sense, they specify whether an event is good in the long run, taking into account the following states.
Finally, models mimic the environment the agent is in and thus allow to make inferences about its future behavior. The methods in RL that use models for planning are called model-based, the ones that rely completely on trial-and-error are known as model-free.
OK, how does RL actually work?
Let’s take the game of Pong as an example (vintage Atari games are used often to explain the inner working of reinforcement learning) and imagine we’re trying to teach an agent how to play it.
In the supervised learning setting, the first thing we’d do is record gaming sessions of a human player and create a labeled dataset to which we’d log each frame shown on the screen (input) as well as each action of the gamer (output).
We’d then feed these input frames to our algorithm and have it predict the correct actions (pressing up or pressing down) for each situation (correctness being defined by our outputs) We’d use backward propagation to tweak the function until the machine gets the predictions right.
Despite the high level of accuracy we could achieve with it, there are some major disadvantages to this approach. Firstly, we must have a labeled dataset to do any sort of supervised learning, and obtaining the data (and anointing labels) might turn out quite a costly and time-consuming process. Also, by applying this kind of training, we’re giving the machine no chance of ever beating the human player; we’re essentially just teaching it how to emulate them.
In RL, however, there are no such limits.
We start off the same way i.e by running the input frames through our algorithm and let it come up with random actions. We do not have target labels for each situation here so we don’t point out to the agent when it’s supposed to press up and when down. We enable it to explore the environment on its own.
The only feedback we provide is that from the scoreboard. Each time the model manages to score a point it gets a +1 reward and each time it loses a point it gets a -1 penalty. Based on this, it will iteratively update its policies so that the actions that bring rewards are more probable and those resulting in a penalty are filtered out.
We need a bit of patience here: at first, the agent, uneducated, will lose the game constantly. As it continues to explore the game, however, it will at some point stumble upon a winning sequence of actions, by sheer luck, and update its policies accordingly.
Problems of Reinforcement Learning
Not everything is peachy in the land of RL. Even the scenario you’ve just read – the agent becoming good at an Atari game – can be quite problematic.
Suppose the algorithm has been playing Pong against a human for some time and it’s been bouncing the ball back and forth quite skillfully. But then it slips towards the end of the episode and loses a point. The reward for the whole sequence will be negative (-1) so the model will assume that every action taken is wrong, which isn’t so.
This is called the Credit Assignment Problem and it stems from the fact that our agent is not getting feedback immediately after every action. In Pong, it can only see the result of an episode after its over, on the scoreboard. So, it has to establish somehow which actions have caused the eventual result.
Due to this scarce reward setting Reinforcement Learning algorithms are typically very sample inefficient. They require a lot of data for training before they become effective.
Also, in some cases, when a sequence of actions needed to get a reward is too long and complicated, the scarce reward system will fail completely. The agent, unable to get a reward by taking random steps, won’t ever learn the correct behavior.
To fight this, RL experts design reward functions manually so that they’re able to guide the agent’s policies towards getting a reward. Typically, these functions give out a series of mini rewards along the route to the big payoff thus providing the agent with needed suggestions. The process of crafting this function is known as Reward Shaping.
Reinforcement Learning Use Cases
Robotics. RL can be used for high-dimensional control problems as well as various industrial applications. Google, for example, has reportedly cut its energy consumption by about 50% after implementing Deep Mind’s technologies. There are innovative startups in the space (Bonsai, etc.) that are propagating RL for efficient machine and equipment tuning.
Text mining. The researchers from Salesforce, a renowned cloud computing company, used RL along with an advanced contextual text generation model to develop a system that’s able to produce highly readable summaries of long texts. According to them, one can train their algorithm on different types of material (news articles, blogs, etc.).
Trade execution. Major companies in the financial industry have been using ML algorithms to enhance trading and equity for a while and some of them, such as JPMorgan, have already thrown their hats into the RL ring too. The company announced in 2017 that it would start using a robot for trading execution of large orders. Their model, trained on billions of historic transactions, would allow to the execute trading procedure promptly, at optimal prices, and offload huge stakes without creating market swings.
Healthcare. Recent papers suggest multiple applications for RL in the healthcare industry. Among them are medication dosing, optimization of treatment policies for those suffering from chronic, clinical trials, etc.
RL has promise for companies, that’s a given, yet it’s important that you not buy into the hype surrounding the technology and assess realistically its strengths, weaknesses and the benefits it can bring to your business. We suggest finding some simple use cases at first to test out how RL works.
If you’d like to learn more about what reinforcement learning is and how it can help your company, contact our expert to get a free consultation.