RLette — Casino Roulette through Reinforcement Learning
By Francesco Colonnese @fcolo_, Andrei Rekesh @AndreiRekesh
Reinforcement learning is one of the most intriguing fields of deep learning at the moment, and it deals primarily with AI “agents” learning to choose from a finite list of actions to take in some relevant environment to maximize some sort of reward.
Examples of agents, actions, and environments can include characters in a video game level choosing from a set of moves to take, and the resultant reward can be defined by whether or not a character has died (negative reward) or progressed within the level (positive reward).
As you can imagine, these situations can become nearly infinitely complex, which would require much computing power and thought to develop a framework by which to assign actions and positive and negative reward. Instead, why not introduce the concept in a simple setting?
This is where roulette comes in.
As stated above, we chose Casino Roulette because of its simplicity and ease of implementation. At its simplest, the game consists of a space of 36 random outcomes numbered 1 through 36. It is possible to bet on individual numbers or sets of them — even, odd, first third, second third, third third, first half, or second half.
A simple environment containing the possible moves already exists, thanks to OpenAI. However, we decided to put our own spin on things — eliminating the possibility to bet on individual numbers and allowing bets on multiple subsets at the same time.
These decisions were represented as a list of 7 elements:
[even, odd, first third, second third, third third, first half, second half]
Each of these options prompts a binary decision — either to bet on that outcome, or to not bet on that outcome. If we think of this as a list of length 7 representing the decision we make for each of the 7 potential outcomes that we want to bet on, we can assign 1 to represent betting on that particular outcome and 0 to represent skipping that outcome. This would result in 2⁷ = 128 possible betting combinations (set of actions). However, as it is not possible to bet on nothing, the situation represented by [0,0,0,0,0,0,0] would instead be “leaving the table”, refusing to play any longer. This total set of possible actions is known as the action space.
Assigning rewards becomes simple: generate a random number between 1 and 36, and check which of your bets came true! For each bet made, add one to total reward if it was spot on, and subtract one if it was wrong.
The solution: Q Learning
But how, realistically, would an AI-driven agent make the best possible decision in every situation within the constraints of their environment?
The answer would be to simply choose the action that would yield the highest reward in the context of their state. In more complex environments, the state context is extremely important; consider, for instance, a character in a game trying to decide whether or not to jump. If they were in front of a short wall, jumping would allow them to clear it and progress in the game. However, if they were in front of a cliff, they would die. The set of total states can also be called the observation space, as the context is often determined by what the agent can “observe” as its setting.
Based on these two spaces, we can now create a 2-way table of actions and states: imagine every state being its own row, and each state being a vector of actions that the agent can take in that state.
This is called a Q-Table. We would “fill in” this Q-Table with the expected reward for each action in a specific state.
Diving into the math behind Q-Learning
But how is expected reward calculated? The agent shouldn’t know all of this immediately — to truly learn, it should explore the action space, experience each reward (or lack thereof), and update its own knowledge of that action-state pair in the table.
A popular way to calculate reward is given by
where α is the learning rate, the rate at which the reward will be “tweaked” with each new experience and γ is how heavily future rewards are weighted in considering overall reward (remember, something that may look rewarding in the short term may penalize an agent in the long run). Both are between 0 and 1. This is known as the Bellman Equation, or at least a simplified version of it.
In English, this equation states that the reward value for a state-action pair updates itself each time it is explored by adding the old reward for that pair to the reward experienced by taking the action and the best possible reward in the next state it would be in. The equation can be split into three terms: the old reward, the reward calculated from taking the action, and the maximum reward of the next state the agent would find itself in after taking the action.
All three of these values are weighted by α , γ, or both according to the equation.
The next problem, though, is allowing the agent to explore all possible options instead of finding a single favorable one, and never exploring anything else again. To remedy this, we can introduce a random chance for the agent to explore a random action instead of choosing the best one every time.
In the case of Roulette, the observation space is always the same, so our Q-Table looks more like a Q-Vector of sorts (namely, there is no situation that makes a bet better or worse) consisting of one state with 128 possible actions. Training our agent simply involved taking actions where the agent would continue to make bets until it “walked away from the table” or reached a 500-action limit. This was repeated thousands of times. A very helpful walkthrough of a different application of Q-learning with code exists here.
Experimenting with other agents
Finally, we tasked our members with researching a bit about other reinforcement learning agents in the library stable baselines, and eventually created a joint environment where 5 agents trained by our members would duel to the highest reward out of Casino Roulette.
In specific, we tested PPO, DQN, A2C, TRPO, and ACER, with the latter turning out to be the fittest agent. This approach was tethered to getting our members used to the research process, and understand how cutting-edge results are evolving and developing from the start, in this specific case, with Q-Learning.
The real achievement: we make AI learn things and we learn through AI
What was really astonishing about our research was that our agent eventually learned to walk away from the table as soon as possible.
If we start thinking about it, this makes sense: any casino game has the odds in the house’s favor — it’s how they make money. By maximizing its reward function, the agent was able to understand that any attempt at playing the game would therefore result in a negative average reward.
We think this application was a rather interesting research opportunity for our members: they got to learn about reinforcement learning, build their own agent, and eventually even got schooled on gambling by it.
This achievement represents another core value of DataRes: the belief that we can learn from AI, provided its learning path and objective is realistic. Pretty cool, right?
Wanna work with us? Email at firstname.lastname@example.org