Challenges of Backtesting Alternative Data
Paul Humphrey, CEO, BMLL

We contributed this article to The Trade News, October 2020

London, 29 Oct 2020

Even though half of hedge fund managers are now using alternative data to gain a competitive edge, 77% of market leaders (with more than USD5 billion in assets) find that backtesting of alternative data poses the biggest challenge, according to a recent report by the Alternative Investment Management Association. “The universe of alternative data sets is expanding so quickly that many of them are not going back in time far enough for models to reveal patterns or capture signals,” the report goes on to explain. 

While alpha generation is key for hedge funds who are looking at different and alternative data sources to uncover alpha and predictive signals, the ability to derive meaningful insights from new alt data, such as Level 3 data for example, requires a dedicated, scalable environment and compute power. This is especially pertinent given that - for the first time ever - Level 3 data for the past 5 years has only just recently been made more widely accessible. To date, only a handful of high frequency trading firms have had the infrastructure to curate this level of data and generate analytics and insight from the full depth order book data. 

Typically hedge funds and asset managers have some in-house data engineering capabilities, yet because they are under resource constraints, there is a clear need for access to data and analytics that break the so-called ‘80/20 rule’, whereby 80% of a data scientist’s valuable time is spent simply finding, cleansing, and organising data, leaving only 20% to actually perform analysis.
Pull quote 1 

Unlocking the potential

Ideally these Level 3 data sets enable participants to understand how markets behave and to position themselves to maximise their alpha from this understanding. In order to arrive at this pinnacle, however, it is first necessary to unlock the full potential of the predictive power of pricing data over a sufficiently long time horizon, to properly capture a wide spectrum of market scenarios. It is imperative that quants and data scientists have at least five years of history for them to be able to do the backtesting at a level of depth and granularity they require to be able to sufficiently analyse historical performance in a statistically robust manner and help them in decision-making. They use it to derive different features and data sets to help in their alpha discovery process, or feed this data into smart order routers. 

But they need clean data. To obtain this sophisticated data elsewhere, firms would have to go to every single exchange, take the dataset and deal with all of their disparate formats and inherent intricacies and differences. This is clearly a challenge in that, if you have a trading strategy that works perfectly well on Nasdaq, ideally you want to easily swap it and see how it behaves on NYSE by simply changing the code, rather than spend added time digging through a multitude of more data sets.
Pull quote 2 

5 years’ worth

Quants and data scientists can now access five years of Level 3 data through our data and analytics platform, for use in their alpha generation workstreams. Because Level 3 is every single order on a limit order book (as opposed to Level 2, which is orders aggregated at a particular price level), a researcher can see not only the price level at what orders are being submitted, but also individual orders and their places in the queue. This allows the users to measure the impact on the order book of their orders, as well as derive metrics such as the average resting time of an order or fill probability at a particular price level of the order book. This enhanced visibility into the workings of the market improves a quant’s efficiency by eliminating the need to spend 80% of their time cleaning data. 

BMLL takes 5 years’ worth of raw Level 3 data from venues across the US and harmonises this data as opposed to normalising it. This approach is taken so that no attributes of the data are lost - nothing has been removed, meaning the intrinsic value that comes with the data remains. This is provided in an easy, consistent manner, delivered ‘off the shelf’ using a series of APIs and via a public cloud, which is the only feasible way to provide cost-effective scalability for that amount of data. 

BMLL’s 5 years of harmonised order book data can be accessed through BMLL’s Data Lab, a Python data science platform, with the ability to process that data at scale and discover inferences by drilling down into every single message. The same data can also be delivered via API or FTP delivery, which means that metrics, such as average resting time of an order or fill probability at a particular level of the order book, can easily be consumed straight into a proprietary system. 
Pull quote 3

80% focus on discovering new alpha

The data sets around Level 3 data are large and complex, and every venue records data in different ways, following different standards and formats. Many firms don’t have the computing infrastructure to derive meaningful analytics from the data, especially up to 5 years’ worth. 

BMLL has done the heavy lifting for hedge funds and asset managers by bringing together the full-depth, historic order book data, the scalable, cloud-based, compute power and unrivalled data science to deliver unique analytics in a cost-effective manner. In effect, what BMLL has done is reverse the conventional 80/20 rule and free up time for quants by taking on the technological lift of normalising the data so they can focus on unprecedented opportunities instead.