Stock Market Trading With Reinforcement Learning

By Nilay Shah, Colin Curtis (Colinpcurtis), Shail Mirpuri

DataRes at UCLA
9 min readJan 10, 2021


In the world of deep learning, no matter how cutting edge your models may be, you don’t get very far without well understood and clean data. This fact is especially true in the realm of finance, where just 5 variables of a stock’s open, high, low, adjusted close, and trading volume are present in our dataset.

Within the first handful of graphs made, it is not hard to tell that these raw data values are insufficient for training a machine learning model. Highly correlated variables may at first sound promising, but the drawback of extremely high correlation coefficients is that not that much information is actually present. The dataset basically has five numbers that are saying exactly the same thing to the model, which makes it very difficult for a model to understand the intricacies of market movements that would allow a machine learning trader to make profit.

This correlation in the data is shown within the scatter matrix below, where the diagonals are an estimate for the distribution of the variable.

Enter technical analysis: a toolbox of mathematics designed to transform noisy raw financial data into understandable and clean signals that quantifies an asset’s momentum, volatility, volume, and other general trends. Luckily for us, a great python library, TA, has all of these indicators and allows for easy experimenting on dataframes. We spent a lot of time finding different combinations of indicators and making our own changes to the dataset to ensure that we have an optimal dataset. We can see the significant result of applying the relative strength indicator (RSI), whose only input is close price, against the raw close price below.

A significant fallacy, however, in the world of mathematical finance is that there exists some perfect combination of features that will “predict” the market for you. Like many methods in data science and machine learning, this tool is really only present to aid in the data transformation phase. This fact manifested itself in the project because eventually you just have to trust that the combinations you currently have are good enough for the model to learn. Thus, we settled on two momentum indicators, the classic relative strength and the interestingly named awesome indicator, and two trend indicators, moving average convergence-divergence and the Aroon indicator.

Momentum indicators are useful because they attempt to quantify how strong the movement of the stock is in the context of its previous prices. This may be helpful to the agent because it can attempt to learn that a gain in momentum is generally a good sign that a stock price may go up, and that it can hold the stock confidently until momentum starts to decrease. On the other hand, trend indicators generally form a superset of momentum indicators in that trend tracing often involves calculations of momentum and moving averages. Usually they attempt to take not-easily quantifiable values generated by momentum and transform them into percentages where positive and negative numbers signify their respective trends. This setup would further aid in the agent’s ability to understand the movements of a stock and hopefully learn that there can be a reasonably high likelihood of profit when both trend and momentum start to uptick.

One key discovery was the application of a signal processing filter on our data, which interpolates a polynomial between a fixed number of points to significantly smooth the data. This is significant since even our technical analysis features were still reasonably noisy and continuous in nature, so smoother data would allow the model to have cleaner information and make better decisions in the environment. The following plot demonstrates the large smoothing effect of the filter on the open price by significantly removing much of the stochastic movement that can easily confuse a model.

After the features were picked, just one more crucial aspect of preprocessing remained, namely normalizing our data. Although it is easy to ignore, forgetting to normalize can severely hinder model performance. Even more interestingly, there was no straightforward way to choose how to normalize our data since financial numbers are unbounded, unlike, for example, images that have a pixel value between zero and 255. A simple rolling window z-score calculation solves this problem quite nicely, since a z-score transforms all our data into the reasonable range of roughly -3 to 3.

Once the input tensor was decided we moved onto the delicate phase of hyperparameter tuning and model optimization. In most deep learning applications, models have multiple tunable hyperparameters, namely variables that we can specify the model to use while training. Changes to these parameters have arguably the most significant result on the model’s performance, since key moments in the model’s training are governed by these values.

We were able to understand the mechanics behind the Proximal Policy Optimization (PPO) framework to help experiment, tune and improve the hyperparameters of the existing model. In the process of doing this, we were able to gain an in-depth insight into the relationship between certain hyperparameters and the agent’s reward earned. This allowed us to really understand whether or not the agent was actually learning. Through our exploration we were able to uncover some fascinating insights that our model learned about stock trading.

In order to test the relationship between the different hyperparameter values and the model’s performance, we decided to take a scientific approach. This approach involved us testing the performance of the agent by changing only one hyperparameter at a time. By ensuring that all other hyperparameters remained constant, we were able to figure out the range of each hyperparameter that most effectively allowed our agent to learn. We also controlled for the randomness of the data trained on in each trial by using seeding. This would ensure that any change in model performance can be attributed to the specified parameter and not other extraneous variables.

Default Parameter Values:

‘n_steps’: 1024,

‘gamma’: 0.9391973108460121,

‘learning_rate’: 0.0001,

‘ent_coef’: 0.0001123894292050861,

‘cliprange’: 0.2668120684510983,

‘noptepochs’: 5,

‘lam’: 0.8789545362092943

Hyperparameters we considered:

  • N_steps: This hyperparameter tells us the number of steps each environment is run before updating the model. This essentially determines how influential a single learning experience is on the updating of the policy. If n_steps is low, this means that the policy will be changing constantly and adapting to experiences that may have been caused by random chance. Therefore, when the n_steps of a model is low, each learning experience is likely to have more influence on the policy changes. A problem with this, however, is that it can lead to a relatively unstable policy that may never converge to its optimal. Thus, finding the right balance for this hyperparameter by tuning it could help achieve better agent trading performance.
  • Gamma: Next we moved on to tinkering the value of the Gamma. This is the discount factor, which basically means it decays how weighted the next reward is on policy. By tuning this we can optimize the amount that our new policy differs from our old policy. This allows our agent to take smaller steps towards its maximizing goal, rather than being overly influenced by the latest experience.
  • Entropy coefficient: We also sought to tune the entropy coefficient which acts as a regularization term and adds randomness to the policy. Exploration is crucial in reinforcement learning for finding a good policy. If the policy converges too rapidly, the agent may find itself stuck in a local maxima repeatedly taking the same suboptimal action. This behavior can be rectified by tuning the entropy coefficient to prevent premature convergence and encourage exploration.
  • Lambda: Lambda is a smoothing parameter used to decrease the variance in the Generalized Advantage Estimator (GAE). The GAE uses the rewards from each time step to estimate how much better off the agent is by having taken a particular action. Lambda helps stabilize this learning by ensuring the policy does not overfit to a particular state-action pair.

Key Findings

After running and fine tuning each of the listed hyperparameters, we drew some interesting conclusions. Firstly, it seemed that a slightly higher n_step value range tended to produce healthier episode rewards and advantage curves. This means that our agent was learning more effective trading policies when it took a greater amount of steps in each environment before updating the model. Since the model seems to perform better when the n_steps parameter is higher, this can potentially imply that the best policy is one, in which a trader buys a stock and holds it for a longer period of time. This could suggest that the best strategy we could take while trading is to buy a stock and hold it rather than micro-trading stocks at higher frequencies.

Apart from some fascinating insights derived from tweaking the n_steps hyperparameter, we also found that the optimal value for the gamma in our model was relatively high, with performance maximized as high as 0.99. The gamma value represents the discount factor, and therefore influences how heavily we update our policy based upon the latest experience. The success of this hyperparameter on the larger values means that new experiences are slightly weighed when changing the policy. This means that the agent only slightly prioritizes short-term rewards.

Adding entropy regularization helps to reduce the noise that is inherent in gradient estimates. From tuning the entropy coefficient we observed that adjusting the default to a higher value of 0.01 led to a more stable increase in episode rewards and produced healthier advantage curves. At the smaller values in the range of 1e-3 to 1e-5, we see that there is a rapid collapse in the entropy loss indicating that the policy of the agent is becoming deterministic too quickly. In contrast, when the value of the entropy coefficient is too high (0.1–0.5) we see the episode rewards flattening and no decrease in the entropy loss, suggesting that our agent is unable to learn since the high entropy coefficient is holding the probability of all possible actions to be nearly the same. For our agent, having a fairly high value of the entropy coefficient helps to guard against taking actions due short term market trends as they do not always translate to long term gains.

In varying the lambda hyperparameter, we found that it had a high optimal value range of 0.99–0.999. When lambda is set to 0 the GAE devolves into the one-step advantage estimator which only considers the current state when making policy updates. This type of policy suffers from high bias. On the other hand, if we take lambda to be 1, the GAE becomes the baseline Monte Carlo estimator which can suffer from high variance. Having a high value for lambda suggests that injecting some bias into the model is important for our agent, but that it does value long term reward. The largest growth is seen when our agent doesn’t become subject to the short term volatility in the market, but rather focuses on incremental gain in the long term.

After sufficient hyperparameter tuning we were able to generate runs of our policy trading using real market data, where each day the policy could buy, sell, or hold stock. The gray dots signify holding, yellow are buy, and green is sell. On the test run below we can see that in general the policy did a good job holding the assets it purchased for a handful of days to generate some profit, but it also experienced drawdown where it lost a handful of its profit. This test run was generated using the suggested hyperparameters earlier in the paper. Although the hyperparameter setup was strong, there was still a lot of volatility in the model, which suggests that strong model training performance does not perfectly correlate with live model results. This result is generally a recurring theme in financial modeling. Nonetheless, it is certainly an amazing feat of reinforcement learning that our agent, which knows has no other goal than to maximize our objective function, was able to make profit.

Overall, our work on this PPO stock market trader allowed us to take a deep dive into cutting edge reinforcement learning research while also working to use our knowledge to solve a real world problem. Although the problem was highly complex, each of us was able to work on tasks that best suited each of our skill sets and later share our results with the rest of the team to improve the model’s performance.



Recommended from Medium