Deep Reinforcement Learning for Bitcoin trading

It’s been more than a year, since the last entry regarding automated Bitcoin trading has been published here. The series was supposed to cover a project, in which we have used deep learning to predict Bitcoin exchange rates for fun and profit.

We have developed the system in 2014 and operated it all through the year 2015. It has performed very well during the first 3 quarters of 2015, … and terribly during the last quarter. At the end of the year we have stopped it. Despite serious losses during the last three months, it can still be considered a solid overall success.

I have never finished the series, but recently we have deployed a new version, which includes some major changes, that hopefully will turn out to be improvements:

  • We use Reinforcement Learning, following DeepMind’s basic recipe (Deep Q-learning with Experience Replay) from the iconic Atari article in Nature magazine. This eliminates the separation of prediction and trading as distinct processes. The inference component directly creates a buy/sell decision instead of just a prediction. Furthermore the new approach eliminates the separation of training and production (after an initial training phase). The neural network is trained continuously on the trading machine. No more downtime is needed for re-training once a week, and no separate compute hardware is lying idle with nothing to do for the other six days of the week.
  • We use Deeplearning4J (DL4J) instead of Matlab code for the training of the neural network. DL4J is a Java framework for defining, training and executing machine learning models. It integrates nicely with the trading code, which is written in Java.

This will change the course of this blog. Instead of finishing the report on what we have done in 2014, I am now planning to write about the new system. It turns out, that most of the code we have looked at so far, is also in the new system, so we can just continue where we left off a year ago.

Advertisements

Data Re-Coding 1

Previously on BigNotesOnPersonalDataScienceTheory: Leonardo, being the great experimentalist that he is, has been collecting Bitcoin price data for weeks, while Sheldono has figured out, why the price prediction with artificial neural networks should work in theory. Now they need Howardo: to cobble the theory and the data together into something, that is actually usable in the real world. Meanwhile it remains unknown, what Rajesho is up to.

Howardo’s job is difficult. Leonardo has provided him with endless time series data from the past. Sheldono gave him an extended, patronizing, dismissive and snotty lecture on how easy it is, to determine, whether or not some limited, rather static data matches some idiosyncratic pattern in the present. And as an output, everyone expects a clear confident prediction of the future. It is clearly impossible to get from the input of his friends to the desired output.

DSCN2853

Poor Howardo! His slight despair turns into utter panic, when he learns, that — of all people in the world — you have been assigned the task of helping him to sort things out.

You recognize, that Howardo has two distinct problems to solve:

  1. Convert the time series data to a format, that can be used as an input for the neural network. Like the webcam picture from the previous post, it should have a fixed size and should come as a coherent block of data, and not as a continuous data stream.
  2. Turn the pattern matching result into a prediction of the future.

“Wow, I didn’t notice that, thank you (RollEye)”, says Howardo, “but how are you going to see the future based on the matching result — without a crystal ball?”

“Well,” you say, “that’s easy. I did this for my boss before.” (Howardo gasps of relief).

You add: “It was a disaster” (Howardo hyperventilates).

Turning the pattern matching result into a prediction of the future

Let’s reconsider: what went wrong with your bosses trend prediction? The basic idea does not seem to be wrong: You recognize a trend and base your action on the assumption, that the trend can be extrapolated.

This works great in daily life and is the reason, why we are able to walk without falling over, to recognize when it is a good day to take an umbrella to work, and to avoid snowballs that people throw at us. All without a crystal ball. All learned by the neural network between our ears.

Your bosses artificial neural network was really good in recognizing a straight line, but failed miserably in predicting the future because the prediction was based on the wrong assumption, that a straight line is a good predictor of future price increases. As it turns out, it is not.

Your boss has only considered a chart of a whole trading day, which means, that he probably bought and sold his shares just before the closing bell. What happened in the early trading hours is rather irrelevant at this point. The overall trend of the day (if there was any) is replaced by a new trend that is fed out of the anticipation of what happens overnight in other markets.

If your boss had thought it to the end, the filter would have probably looked not like this:

btcBlog8x8FilterHeatmapUptrend

Instead it would be more like this:

btcBlog8x8FilterHeatmapUptrendOverweightLaterHours

Chances are, that with this filter as the intermediate layer weight matrix of the neural network, your boss would have earned some money. But the prediction performance still would be far from great. It would just be a tool to point out, what is already obvious. Instead of a bad predictor, we now have a mediocre predictor.

This is the point, where intelligent design reaches it’s limit. In order to make better predictions, the weights in the weight matrix must be learned from real data, and not set by you. You must get the neural network to adapt in response to success and failure, much like you learned to predict, if a snowballs trajectory ends in your face or not — from failure and success.

“Howardo”, you hear yourself saying, “we must feed the learning algorithm with the first half of the price’s trajectory. We let the neural network predict, if the trajectory leads to a pleasant impact spot or not. When it is wrong, we will punish it”.

Howardo comes back to life. “That sounds like fun” he says. You are not sure, what part of your proposal he refers to, but it’s probably the punishment part.

For reasons that will become apparent in a later post, you decide to call the partial price trajectory a “feature vector” and the information, if the resulting price is pleasant or not, “class label”.

With this insight, it becomes easy to define a data format, that is suitable for training the neural network:

  • The feature vector is the content of a sliding window that you pull through Leonardo’s historic Bitcoin price data. For example you could say, that for each point in time you read the previous 1000 price samples out of the time series into a feature vector.
  • For the class label you have to peek into the future. Thank goodness, time is relative and from your point in spacetime, the near future of all of Leonardo’s price samples is also in the past. So you decide that for each sample in the time series you compare the price with the price 10 samples later, which accords to “10 minutes later”. If the later price is higher, you choose the class label “pleasant”, otherwise the class label “unpleasant”.

When your neural network is finally fully trained with this data, it will still not be able to look into the future. But it will hopefully be able to classify an aggregation of 1000 consecutive price samples as member of a class of price trends, that — with a certain likelihood — will lead to a higher price 10 minutes later. And that’s all we can hope for.

This brings us to a little taxonomical oddity. In much of machine learning literature, “prediction” seems to be used as a synonym for “recognized class”. This leads to funny statements like “the classifier predicts that the picture shows a cat” — after the classifier has processed the picture of the cat. Better get used to it…

In the next post, we will examine the actual Java code for the data re-coding.

 

Why Neural Networks work

One popular explanation of the fact, that artificial neural networks can do what they can do, goes along these lines:

  1. A brain is capable to do these things.
  2. An artificial neural network is a simulation of a brain.
  3. Therefor an artificial neural network can do these things, too.

20160222_091412.jpgAdmittedly, most things in computer science that “work” in the sense, that they produce useful output for the real world, are implementations of theoretical models that have been build by other sciences, so this is kind of a valid explanation. I don’t like it anyway.

My first problem with it is this: it’s not quite true, that an artificial neural network (ANN) is a simulation of a brain. To be fair, some  come impressively close. But in the context of this blog we unambitiously restrain ourselves to the level of sophistication that we find in most real world ANNs, which are radical (radical!) simplifications of even the simplest natural neural networks.

Second: it does not help most people to understand, why the ANN is capable of doing useful work. Unless you already understand the brain, it won’t help you much, when I tell you, how we are going to map gray matter to mathematical concepts. (And if you already understand the brain: Welcome to the blog. You can skip the rest of this post, if you want.)

I want to come from the other side, and approach the topic as an engineering problem. Buckle up. We are going to manufacture a special purpose classification machine, and then (in a later post) we will generalize it and see if the result has any similarity to what the neuro sciences know about the brain.

As a basic motivation, let’s assume, that your boss has found a webcam that shows a stock market chart (like this), and came up with this brilliant idea: he will become insanely rich with a new software, that reads the chart and outputs some kind of likelihood, that the market is in an upward trend. Your boss calls this likelihood “Zuversicht”, and we are going to stick with this term for a while, because we like German words, and because the corresponding English word (“confidence”) already has a certain meaning in statistics, and we want to prevent confusion resulting from ambiguous terminology.

Ok, now our input is an image from a webcam, so we have a two dimensional array of pixel colors. To make it easier, you convert the image to grayscale, so you only have to think about the pixels’ brightness and ignore hue and saturation. You look at some examples of upwards trends and can’t help but to observe, that the lines tend to start in the lower left corner and zigzag their way to the upper right corner.

btcBlog2x2FilterHeatmapUptrendStrictWithChart

Breaking down the image in quadrants, you notice, that in these cases the average pixel brightness in Q2 and Q3 is higher than in Q1 and Q4. With this insight you write the following lines of code and declare your job done.

double zuversichtUptrend(double[][] image){
  int imgLenth = image.size();
  int imgWidth = image[0].size();
  double averageBrightnessQ1 = Math.avg(Arrays.subarray(image,0,imgLenght/2,0,imgHeight/2);
   ...
  return averageBrightnessQ2 + averageBrightnessQ3  

}

It works great on the test data, your boss is happy and his boss gives him a raise. But a few weeks later, he tells you, that he’s not happy anymore. He has not become insanely rich!

What went wrong?

Apparently, your program has mis-classified the trend on several occasions. So you have a look on the chart images for these days, and see two major flaws of your approach:

btcBlogChartWebcamFilterTooCourse

  1. On some days, the chart went almost flat or turned back to negative. The chart was just low enough in the early hours to run through Q3 and just high enough in the later hours to run mostly through Q2.
  2. On other days the chart went clearly down, but your software’s Zuversicht value was very high. The reason turns out to be, that the overall brightness of the picture was high on those days, illuminating Q2 and Q3 without the chart line covering much space in them.

So you start the second iteration of your engineering endeavor.

To solve problem 1, you obviously need a higher resolution. Let’s try 8×8! This partition conveniently allows us to identify each field with a chessboard notation.

btcBlog8x8FilterHeatmapUptrendStrict

A perfect upward trend, wich your boss defines as a straight line from the lower left to the upper right, will result in the fields A1, B2, …, H8 lit up and the other fields remain dark. The flat chart from problem 1 will rather lite up the fields A4, B4, .., G5, H5. Great, but what about all the other possible charts that show a trend that goes upward in a non steady, somewhat chaotic fashion? This is, after all, rather the norm then the exception.

btcBlogChartWebcamFilter8x8Strict

Let’s add some fuzziness to the system.  The intuition is like this: For each field you guess the probability that the full chart shows an overall upward trend if the particular field is lit. For example, if the lower right corner (H1) is lit, the probability of an overall positive trend is zero. If the field left to it (G1) is lit, the probability is close to zero, but there is still a possibility, that the chart makes a radical upward turn in the remaining 1/8 of the chart. The closer you get to the perfect upward trend, the higher the probability becomes.

btcBlog8x8FilterHeatmapUptrend

You call the resulting 8×8 numbers a “weight matrix”. You can use it as a filter for the actual chart images by doing the following:

  1. For each field of the loaded picture you multiply the actual average brightness with the corresponding value in the probability matrix. The product will be a high value, when the average brightness is high and the value in the probability matrix is high. Otherwise it is a low value. You repeat this step for each field, 64 times altogether
  2. You add up all the  products.

The closer the actual chart zigzags around the ideal chart, the higher the sum will be. But even when the actual chart goes astray: if it remains on a positive trajectory, we will get a relatively high result in this calculation.

So far so good. Let’s  look at the second problem: the tide lifts all boats and the ceiling light lights up all pixels. When someone turns on the light in the trading room, all pixels in the Webcam picture become brighter. Even areas that are not trespassed by the line chart seem brighter, which renders our filtering result worthless.

Let’s add a preprocessing step to fix this.  If there was no line chart in the picture, all pixels would have approximately the same brightness and they were supposed to be black (brightness zero). If in this case, you would subtract the overall average brightness from each fields measured brightness, the result would be all black fields. Subtracting the overall average brightness normalizes the picture to what it is needed for our further processing.

Now add the line chart to your consideration. Because it covers only a very small fraction of the image, it does not change the overall average brightness too much. The light noise that illuminated the dark parts of the image, also made the bright parts (the lines of the chart) brighter. So if we subtract the average overall brightness from the bright pixels, we also normalize those parts of the image to what is expected as an input for the next processing step.

Great, now you know what to do to solve problem 2. Question is: How do you do it. Wouldn’t it be great if you could implement both processing steps in a unified way. In other words: is it possible to define a weight matrix in such a way that when we apply it to the input data, the average overall brightness subtraction of your preprocessing step is executed. Turns out: it is possible.

Imagine the following weight matrix for field A1:

  • Value at position A1: 1-1/64
  • Value at all other positions: -1/64

Please convince yourself, that this Matrix will do the average subtraction for position A1. Of course, this works just as well for all other positions.

btcBlogChartWebcamFilter8x8PixelNormalization

Hmmm, interesting, you just solved two seemingly totally different problems with the same approach. It feels a little odd to define a huge matrix for a calculation that could easily be done procedurally, but you have a feeling, that there might be a systematical advantage in a unified way to tackle problems in this project. Also, of course, you know, that vector (and with it matrix-) calculations are the strongpoint of GPU data processing as well as highly optimized Software packages like Matlab (“Matrix Lab”!) and Octave. You feel that after your initial success, your boss might become greedy, which will ultimately put more load on your software. Having some strong performance afterburners like these in your arsenal, might come out handy later.

Your overall process has three steps now:

  1. You create an 8×8 matrix from the image data as input data layer. (To facilitate vector operations, you “flatten” this matrix to a vector of lenght 64, but that’s an implementation detail).
  2. For each field you apply the corresponding preprocessing 8×8 weight matrix to whole input layer 8×8 matrix. The result is a new 8×8 matrix, which you call the “hidden layer“. (And in the real world, you would do this again with “flattened” vectors and a large 64×64 weight matrix representing all fields. This is mathematically equivalent and can be well parallelized. Again: just an implementation detail)
  3. You apply the classification weight matrix an the hidden layer and get the Zuversicht value as output.

 

There you go: without thinking much about neurons, synapses and ganglia, you have handcrafted your first artificial neural network. Your new software is actually what people call a Feedforward Neural Network with a linear activation function. When you define a threshold value for the Zuversicht output, you also have a binary linear classifier.

Your neural network is still far from being perfect. You will eventually get there, but not today. Lets just mention a few things that you would need to think about before going into production:

  • It is not able to learn! It works because you were able to provide a “model” (that is the weights in the weight matrices). This is good enough for now, but for the future we prefer to let the computer do the work of figuring out the model data.
  • It is not well protected from eccentric input data. Imagine what happens, if a camera error or a data transfer problem produces for a single pixel a value of  325212498434 instead of a value in the expected range between 0 and 1.
  • It will still fail to make your boss immeasurably rich, because it does not predict anything. It only classifies a chart as close enough to your bosses definition of a perfect chart. This is, what he wanted, so it is partly his fault. But we nevertheless can do better.

Even with these shortcomings, you hopefully have built up some comprehension as to how a neural network is able to recognize a pattern. We have seen that, even without actively imitating nature, we get to a similar result, when we just work our way to the best solution in a straightforward manner.

A little heads-up: In the next post, we will build the software to convert the collected Bitcoin price and market data to a format suited as an input data layer for a neural network like this. If your data collector from the previous post is not running yet, please start it soon to have some data to play with next time.