Building the Reinforcement Learning Framework

To build our reinforcement learning framework, we are going to follow the basic recipe laid out in the February 2015 Nature article “Human-level control through deep reinforcement learning” (http://dx.doi.org/10.1038/nature14236).

Reinforcement learning has been shown to reach human and superhuman performance in board games and video games. Transferring the methods and experiences from this to the use case of trading goods or securities seems promising, because it has many similar characteristics:

  • interaction with an environment that represents certain aspects of the real world,
  • a limited set of actions to interact with this environment,
  • a well-defined success measure (called “reward”),
  • past actions determine the future rewards,
  • a finite, semi structured definition of the state of the environment,
  • unfeasibility of directly determining the future outcome of an action due to a prohibitively large decision tree, incomplete information and missing knowledge about the interaction between the influencing factors.

Our inference engine is going to be Deeplearning4J (DL4J, see https://deeplearning4j.org/). The DL4J website contains a very brief and well written introduction to reinforcement learning, which I highly recommend, if you are not familiar with the concept yet: https://deeplearning4j.org/reinforcementlearning.

The first step in implementing a RL framework for Bitcoin trading is, to map the conceptual elements of the process to our use case:

  • Action
  • Actor / Agent
  • Environment
  • State
  • Reward

Action

An action is a distinct operation with a direct impact on the state of the actor and the environment. In the case of game playing, placing a tile on a specific field on the board or moving a joystick in a certain direction are examples of actions. In the case of Bitcoin trading, the obvious actions are placing and cancelling orders to buy or sell certain amounts of Bitcoin at a given cryptocurrency exchange.

A smaller set of actions improves the learning speed. For optimal performance we will restrain our action set to only three possible actions for now:

  1. Cancel all open sell orders and place a buy order at the last market price using 10% of the available USD assets.
  2. Cancel all open buy orders and place a sell order at the last market price using 10% of the available Bitcoin assets.
  3. Hold (do nothing).

In a later version we will likely extend this to

  • have cancelling and placing orders as distinct actions,
  • a larger variety of amounts (other than “10% of available assets”) to use for buy and sell orders,
  • different limits, above and below the last market price.

But for gaining experience, and to go easy on our computational resources, we are going to keep it simple for now.

Actor

The actor is our trading bot, which is using the Bitstamp-API to place and cancel orders. We are going to reuse existing Java code from the old trading system for this.

Environment

Since we don’t want to reinvent the data collection and we already have collected several years worth of training data, the environment is given by all the data sources that we have defined in the old trading system. (https://notesonpersonaldatascience.wordpress.com/2016/03/06/wrapping-up-data-collection/)

State

The current state of the environment is the input for the inference machine. We can reuse the format that we have used for the old Bitcoin prediction system for this (https://notesonpersonaldatascience.wordpress.com/2016/03/21/data-re-coding-1/). It has some issues that we might address later, but we don’t want to reinvent the wheel so we stick to it for now.

Reward

We have two possible ways to define the reward:

  • After each executed sell order: the difference between the sell price and the previous average buy prices of the sold Bitcoins, minus transaction costs
  • In each step: the difference of the current net value (USD +BTC) and the net value in the previous step.

The first option compares better to the game analogy and also takes advantage of one of the key features of reinforcement learning : assessing future outcomes of current actions), but the second option promises faster convergence, so to begin, we choose the second option.

 

Advertisements

Transition to Reinforcement Learning

Our current goal is to introduce Reinforcement Learning into the decision making component of an existing Bitcoin trading system. To put this into a broader context: What we are about to do is — in the flowery terms of business speak — the upgrade from “Predictive Analytics” to “Prescriptive Analytics”. Please google for “Analytic Value Escalator” for a 30000 ft. view, and think a minute about this: Is the question “What will happen” really harder to answer then “Why did it happen”? (In data science you have to take market research serious, even if it hurts.)

Walking up a step on this escalator, with the task at hand, we are interested in going from the question “What will happen?” to “How can we make it happen?”. Again, I am not completely convinced, that the latter one is more difficult to answer, but it requires us to make a big architectural change in the trading system.

To understand this, let’s have a look on the training pipeline that we have used so far.

Old Architecture: Separation of Concerns

BtcOldPredictionPipeline

We had three servers involved:

  • A build server running Jenkins, which is used for multiple projects and not of particular interest for us here and now.
  • A server running the trading software, called “Trade Hardware”, executing a stack of shell scripts and Java programs.
  • A powerful machine for computationally intense tasks implemented in Matlab and Java, called “Compute Hardware”.

Here is what happens at the blue numbered circles:

  1. Once a week the build server triggered a training session. The reason for regular re-training is, that we wanted to have current trends in the market behavior to be reflected in our prediction model. Also, each week we had aggregated significantly more training data. More training data promised better prediction results.
  2. Input variables are collected and normalized for neural network training.
  3. Target variables are calculated for training. We have used a binary classifier with 19 output variables that predicted different events like this: “The BTC/USD rate will go up by 2% in the next 20 minutes”.
  4. To reduce the size of the input, a PCA was performed and only the strongest factors were used as input variables. The PCA Eigenvectors and the normalization factors from step 3 are stored for later, to transform raw input data in production to a format consistent with the training input.
  5. The previous neural network model is loaded to be used as initial model for training.
  6. The training is run in Matlab. We don’t need to dive deeper into this topic, because in the new architecture, we will use Deeplearning4J instead of Matlab for the training step.
  7. The new trained model is stored.
  8. The new model is tested in an extensive trade simulation.
  9. The trading software is restarted so it uses the updated model.
  10. Normal trading goes on for the rest of the week.

New architecture: Tight Integration

This pipeline has been built around the concept of a strict separation between trading execution and prediction. The prediction algorithm was part of a decision making module, which itself was just a plugin module of the trading software which could be replaced by another implementation that encodes another strategy. This was actually used to assess simulation results: To determine a baseline performance the decision making component in the simulation has been replaced by one that follows a random trading strategy.

With the transition to reinforcement learning, this strict separation goes away. The learning agent learns from the interaction with the environment, so it completely assumes the role of the decision making component.  From a system design perspective, this makes our life much easier, because many hard questions in the decision making component are now covered by machine learning:  How to interpret the predictions? Where to set thresholds for the output variables? What percentage of available assets to use in a trade? The reinforcement learning agent produces a finished decision that can be directly converted into a buy- or sell-order.

Also the agent does not stop learning once it is in production. The learning is a permanent background process, that takes place during trading. This means, that after the initial training phase, we can retire the “Compute Hardware”, because there is no necessity for weekly retraining.

All this looks lean and efficient at first glance, but it will create problems down the road:

  • The tight integration between machine learning and business logic results in a monolithic architecture,  which will be hard to maintain.
  • The interface between data science and software development has been largely inflated. In the old architecture, the responsibility of data science ended with the prediction, and software development could take on from there. Both groups worked strictly within the bounds of their traditional realms, each with their established tools and processes, which have very little in common, other than the fact that they mostly run on computers. The new architecture leads to a huge overlap of responsibilities, which will require new tools, a common language, mutual understanding and a lot of patience with each other.

Conclusion

Even without looking at the specifics of Reinforcement Learning, we already see, that the system design will become much simpler. The machine learning subsystem assumes more responsibility, leaving less moving parts in the rest of the system.

The project management on the other hand might turn out to be challenging, because the different worlds of data science and software development need to work much closer together.

Further Reading

Deep Reinforcement Learning for Bitcoin trading

It’s been more than a year, since the last entry regarding automated Bitcoin trading has been published here. The series was supposed to cover a project, in which we have used deep learning to predict Bitcoin exchange rates for fun and profit.

We have developed the system in 2014 and operated it all through the year 2015. It has performed very well during the first 3 quarters of 2015, … and terribly during the last quarter. At the end of the year we have stopped it. Despite serious losses during the last three months, it can still be considered a solid overall success.

I have never finished the series, but recently we have deployed a new version, which includes some major changes, that hopefully will turn out to be improvements:

  • We use Reinforcement Learning, following DeepMind’s basic recipe (Deep Q-learning with Experience Replay) from the iconic Atari article in Nature magazine. This eliminates the separation of prediction and trading as distinct processes. The inference component directly creates a buy/sell decision instead of just a prediction. Furthermore the new approach eliminates the separation of training and production (after an initial training phase). The neural network is trained continuously on the trading machine. No more downtime is needed for re-training once a week, and no separate compute hardware is lying idle with nothing to do for the other six days of the week.
  • We use Deeplearning4J (DL4J) instead of Matlab code for the training of the neural network. DL4J is a Java framework for defining, training and executing machine learning models. It integrates nicely with the trading code, which is written in Java.

This will change the course of this blog. Instead of finishing the report on what we have done in 2014, I am now planning to write about the new system. It turns out, that most of the code we have looked at so far, is also in the new system, so we can just continue where we left off a year ago.