Building the Reinforcement Learning Framework

To build our reinforcement learning framework, we are going to follow the basic recipe laid out in the February 2015 Nature article “Human-level control through deep reinforcement learning” (http://dx.doi.org/10.1038/nature14236).

Reinforcement learning has been shown to reach human and superhuman performance in board games and video games. Transferring the methods and experiences from this to the use case of trading goods or securities seems promising, because it has many similar characteristics:

  • interaction with an environment that represents certain aspects of the real world,
  • a limited set of actions to interact with this environment,
  • a well-defined success measure (called “reward”),
  • past actions determine the future rewards,
  • a finite, semi structured definition of the state of the environment,
  • unfeasibility of directly determining the future outcome of an action due to a prohibitively large decision tree, incomplete information and missing knowledge about the interaction between the influencing factors.

Our inference engine is going to be Deeplearning4J (DL4J, see https://deeplearning4j.org/). The DL4J website contains a very brief and well written introduction to reinforcement learning, which I highly recommend, if you are not familiar with the concept yet: https://deeplearning4j.org/reinforcementlearning.

The first step in implementing a RL framework for Bitcoin trading is, to map the conceptual elements of the process to our use case:

  • Action
  • Actor / Agent
  • Environment
  • State
  • Reward

Action

An action is a distinct operation with a direct impact on the state of the actor and the environment. In the case of game playing, placing a tile on a specific field on the board or moving a joystick in a certain direction are examples of actions. In the case of Bitcoin trading, the obvious actions are placing and cancelling orders to buy or sell certain amounts of Bitcoin at a given cryptocurrency exchange.

A smaller set of actions improves the learning speed. For optimal performance we will restrain our action set to only three possible actions for now:

  1. Cancel all open sell orders and place a buy order at the last market price using 10% of the available USD assets.
  2. Cancel all open buy orders and place a sell order at the last market price using 10% of the available Bitcoin assets.
  3. Hold (do nothing).

In a later version we will likely extend this to

  • have cancelling and placing orders as distinct actions,
  • a larger variety of amounts (other than “10% of available assets”) to use for buy and sell orders,
  • different limits, above and below the last market price.

But for gaining experience, and to go easy on our computational resources, we are going to keep it simple for now.

Actor

The actor is our trading bot, which is using the Bitstamp-API to place and cancel orders. We are going to reuse existing Java code from the old trading system for this.

Environment

Since we don’t want to reinvent the data collection and we already have collected several years worth of training data, the environment is given by all the data sources that we have defined in the old trading system. (https://notesonpersonaldatascience.wordpress.com/2016/03/06/wrapping-up-data-collection/)

State

The current state of the environment is the input for the inference machine. We can reuse the format that we have used for the old Bitcoin prediction system for this (https://notesonpersonaldatascience.wordpress.com/2016/03/21/data-re-coding-1/). It has some issues that we might address later, but we don’t want to reinvent the wheel so we stick to it for now.

Reward

We have two possible ways to define the reward:

  • After each executed sell order: the difference between the sell price and the previous average buy prices of the sold Bitcoins, minus transaction costs
  • In each step: the difference of the current net value (USD +BTC) and the net value in the previous step.

The first option compares better to the game analogy and also takes advantage of one of the key features of reinforcement learning : assessing future outcomes of current actions), but the second option promises faster convergence, so to begin, we choose the second option.

 

Advertisements

One thought on “Building the Reinforcement Learning Framework

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s