Transition to Reinforcement Learning

Our current goal is to introduce Reinforcement Learning into the decision making component of an existing Bitcoin trading system. To put this into a broader context: What we are about to do is — in the flowery terms of business speak — the upgrade from “Predictive Analytics” to “Prescriptive Analytics”. Please google for “Analytic Value Escalator” for a 30000 ft. view, and think a minute about this: Is the question “What will happen” really harder to answer then “Why did it happen”? (In data science you have to take market research serious, even if it hurts.)

Walking up a step on this escalator, with the task at hand, we are interested in going from the question “What will happen?” to “How can we make it happen?”. Again, I am not completely convinced, that the latter one is more difficult to answer, but it requires us to make a big architectural change in the trading system.

To understand this, let’s have a look on the training pipeline that we have used so far.

Old Architecture: Separation of Concerns


We had three servers involved:

  • A build server running Jenkins, which is used for multiple projects and not of particular interest for us here and now.
  • A server running the trading software, called “Trade Hardware”, executing a stack of shell scripts and Java programs.
  • A powerful machine for computationally intense tasks implemented in Matlab and Java, called “Compute Hardware”.

Here is what happens at the blue numbered circles:

  1. Once a week the build server triggered a training session. The reason for regular re-training is, that we wanted to have current trends in the market behavior to be reflected in our prediction model. Also, each week we had aggregated significantly more training data. More training data promised better prediction results.
  2. Input variables are collected and normalized for neural network training.
  3. Target variables are calculated for training. We have used a binary classifier with 19 output variables that predicted different events like this: “The BTC/USD rate will go up by 2% in the next 20 minutes”.
  4. To reduce the size of the input, a PCA was performed and only the strongest factors were used as input variables. The PCA Eigenvectors and the normalization factors from step 3 are stored for later, to transform raw input data in production to a format consistent with the training input.
  5. The previous neural network model is loaded to be used as initial model for training.
  6. The training is run in Matlab. We don’t need to dive deeper into this topic, because in the new architecture, we will use Deeplearning4J instead of Matlab for the training step.
  7. The new trained model is stored.
  8. The new model is tested in an extensive trade simulation.
  9. The trading software is restarted so it uses the updated model.
  10. Normal trading goes on for the rest of the week.

New architecture: Tight Integration

This pipeline has been built around the concept of a strict separation between trading execution and prediction. The prediction algorithm was part of a decision making module, which itself was just a plugin module of the trading software which could be replaced by another implementation that encodes another strategy. This was actually used to assess simulation results: To determine a baseline performance the decision making component in the simulation has been replaced by one that follows a random trading strategy.

With the transition to reinforcement learning, this strict separation goes away. The learning agent learns from the interaction with the environment, so it completely assumes the role of the decision making component.  From a system design perspective, this makes our life much easier, because many hard questions in the decision making component are now covered by machine learning:  How to interpret the predictions? Where to set thresholds for the output variables? What percentage of available assets to use in a trade? The reinforcement learning agent produces a finished decision that can be directly converted into a buy- or sell-order.

Also the agent does not stop learning once it is in production. The learning is a permanent background process, that takes place during trading. This means, that after the initial training phase, we can retire the “Compute Hardware”, because there is no necessity for weekly retraining.

All this looks lean and efficient at first glance, but it will create problems down the road:

  • The tight integration between machine learning and business logic results in a monolithic architecture,  which will be hard to maintain.
  • The interface between data science and software development has been largely inflated. In the old architecture, the responsibility of data science ended with the prediction, and software development could take on from there. Both groups worked strictly within the bounds of their traditional realms, each with their established tools and processes, which have very little in common, other than the fact that they mostly run on computers. The new architecture leads to a huge overlap of responsibilities, which will require new tools, a common language, mutual understanding and a lot of patience with each other.


Even without looking at the specifics of Reinforcement Learning, we already see, that the system design will become much simpler. The machine learning subsystem assumes more responsibility, leaving less moving parts in the rest of the system.

The project management on the other hand might turn out to be challenging, because the different worlds of data science and software development need to work much closer together.

Further Reading

Deep Reinforcement Learning for Bitcoin trading

It’s been more than a year, since the last entry regarding automated Bitcoin trading has been published here. The series was supposed to cover a project, in which we have used deep learning to predict Bitcoin exchange rates for fun and profit.

We have developed the system in 2014 and operated it all through the year 2015. It has performed very well during the first 3 quarters of 2015, … and terribly during the last quarter. At the end of the year we have stopped it. Despite serious losses during the last three months, it can still be considered a solid overall success.

I have never finished the series, but recently we have deployed a new version, which includes some major changes, that hopefully will turn out to be improvements:

  • We use Reinforcement Learning, following DeepMind’s basic recipe (Deep Q-learning with Experience Replay) from the iconic Atari article in Nature magazine. This eliminates the separation of prediction and trading as distinct processes. The inference component directly creates a buy/sell decision instead of just a prediction. Furthermore the new approach eliminates the separation of training and production (after an initial training phase). The neural network is trained continuously on the trading machine. No more downtime is needed for re-training once a week, and no separate compute hardware is lying idle with nothing to do for the other six days of the week.
  • We use Deeplearning4J (DL4J) instead of Matlab code for the training of the neural network. DL4J is a Java framework for defining, training and executing machine learning models. It integrates nicely with the trading code, which is written in Java.

This will change the course of this blog. Instead of finishing the report on what we have done in 2014, I am now planning to write about the new system. It turns out, that most of the code we have looked at so far, is also in the new system, so we can just continue where we left off a year ago.