Getting started

20160222_091116.jpgUnfortunately every data science project starts with the somewhat tedious task of data acquisition and organization.

To accomplish anything at all, the first thing you’ll need is training data. So before anything else, you want to start collecting a lot of it.

Now, in a professional setting, you want to take some time and think about the volume and structure of your input- and output-data. Therefore I don’t recommend doing at work, what we are about to do next.

We postpone the careful thinking for now, because we want to get things moving, and we are pretty sure that, whatever the result of our (later) deep thinking might be, it will contain exchange rate ticker quotes for Bitcoin. Deliberating on this just a little bit further, we convince ourselves, that other ticker quotes (currency exchange rates, stock prices, economic indicators) might also be useful for the prediction of the Bitcoin price, and that there could be some data points related to the quotes that may give us a little statistical advantage, too.

While musing about that, it occurs to us, that for testing purposes we might also need a mechanism to easily generate random quotes, when no data source is available. And for assessing the quality of the training results, we might later need a mechanism to replay historical quotes to run old data against an updated neural network and find out, how well it would have performed during a certain time interval.

All these considerations lead us to our first little class diagram:


We see an interface for some sort of adapter (QuoteReader) with none of our favorite design patterns incorporated, and not even a data representation class around. I realize that this is scary. Get used to it! Because this is not an accident. We will do a lot of number crunching and – believe it or not – in this context, it is GOOD practice, to sacrifice beauty and object orientation on the altar of performance. The base rhythm of our architecture will be, to prevent object creation in critical areas whenever possible. We will use arrays instead of collections, unless we need collections in external libraries. We will use primitive data types whenever possible. It will feel a lot like 1985 with one positive side effect: We will be quite happy about these decisions when we try to communicate with the GPU later.

With this said, and the aesthetically minded among you properly scared, we move on to have a closer look on the interface:

package de.hsec.datascience.btctrader; 
 * Interface for Adapter classes to ticker information sources.
 * @author helmut
public interface QuoteReader {
 * Returns the current ticker value. Either a price fixed 
 * by a market maker or the last trading price. 
 public double getCurrentQuote();

 * Returns the highest bid price in the order book. 
 public double getBid();

 * Returns the lowest ask price in the order book. 
 public double getAsk();

 * Returns the lowest price of the last 24 hours. 
 public double getMin24();

 * Returns the highest price of the last 24 hours. 
 public double getMax24();

 * Returns the trade volume of the observed exchange. 
 public double getVolume24();

 * Returns the volume-weighted average price. 
 public double getVwap();



Ok, so we can use an instance of such a QuoteReader to access what seems to be market data from some exchange. The accessor methods come without a timestamp or index parameter so we (correctly) assume, that they return current data. We’ll have to discuss our working definition of the word “current” later.

In the next post, we’ll take a closer look on the implementations, especially the BitstampQuoteReader.


Predicting Bitcoin Prices

In this initial blog series, I am going to report on an automated bitcoin trading system, that I have build in 2014 and sucessfully operated during 2015.

The decision making component in this trading system incorporates machine learning methods: mainly a neural network and – in a data preparation step – principal component analysis (PCA).

The code was written in Java and Matlab. It is not always pretty, so please when reading through it, keep in mind, that this has started as a hobby project.

Some of the code I can not publish, which I will explain when I come to it. But I will point out how to fill in the gaps.

I would like to encourage people to rebuild the system, use it to try out their own ideas and share them with the rest of us. Also I want to point out, that while bitcoin trading is a good point to start, it is certainly not the only area, where these methods are applicable.

Why is bitcoin a good point to start? Because of an excellent technological infrastructure and immediate financial rewards, to name a few reasons. Also Bitcoin is cool, which for me has some value on it’s own.

In the 12 months of operation, the system initiated roughly 11000 transactions on Bitstamp, a Bitcoin exchange which among other things allows to trade Bitcoin against fiat currency (USD). The system yielded a gross revenue a little above 26%. After transaction fees, a pre-tax return around 20% remained. The result after taxes is a wholy different story, which we will talk about in a later post.

Now, a buy and hold strategy during this year would have given me the same revenue during this time interval, even with less transaction fees. But I could not have known that in the beginning of the year.

The approach of the trading system is obviously completely different. It tries to predict small movements in the near future (a few minutes) based on observed market activities, news, economic data and a few other factors. In essence, it exploits the prices’ volatility. The beauty of this is, that it works almost as well, when the overall direction is southwards.

During the first months of the year, while doing it’s first clumpsy, inexerienced trading steps, the system has recorded the input data and added it to an increasingly larger body of training data. The neural network has been trained and retrained several times, each time with more input data. The results turned out increasingly better. From January to April the trading yielded net negative results while the overall market went sidewards. After that the results where positive, even during a severe market decline in November. The last training took place in May. Due to memory constraints (and because the training time has passed 24 hours), training with more data would have made a different approach necessary. Since the results were already satisfactory, I have decided to stick to what I have. So that is, where we are now: Having quite some room for improvement.

In the next few posts, I will very briefly lay out the theoretical foundation to the project, before we take a closer look into the code.