The OpenSAFELY collaborative: a unique EHR data platform built for Covid-19 research

I joined TPP as a developer almost 15 years ago, after finishing a PhD in mathematics. During that time I’ve been pretty heavily involved in research with electronic health records; I’ve worked with data from almost every NHS system supplier, every clinical terminology, and every drug database. This has proved to be very useful as we have scaled up our own research capacity, notably working with researchers from Oxford University and London School of Hygiene and Tropical Medicine (LSHTM) since the beginning of the Covid-19 pandemic.

We’d previously done some work with Ben Goldacre’s team at the University of Oxford’s EBM DataLab. His team had found a significant NHS money saver for certain types of prescription. We were able to make front-end changes to our GP system to help realise this cost-saving. It was a really good “closing the loop” success story and we were all keen to do some further work. In early March we bumped into Ben again at a conference and ended up plotting a joint research project to analyse clinical decision support in EHRs. We bounced some ideas around, agreed to have a call as soon as we could, and left it at that.

About a week after that encounter with Ben, late on a Friday afternoon, we had a call with him. There is no doubt that Covid-19 was looming large at that time. The acceleration towards daily briefings, stronger measures and eventually lockdown was extraordinary. We did talk about the clinical decision support project, but the conversation turned quickly to a Covid research initiative. The question was essentially this: Could we jointly stand up a new, highly-secure, massive, linked, de-identified EHR data platform for Covid research and could we do this very, very quickly? The subsequent conversation with the other directors at TPP happened within 5 minutes and lasted about 30 seconds. It was clear that this was a crucial thing to try and do. At both the academic and industry sides we committed to doing this without any funding – we simply didn’t have the time to wait.

The culture at TPP is really just to get on and do things. It’s definitely the same at the DataLab. Within the hour, we’d had two calls from Seb Bacon. Seb is lead software developer at DataLab and a seriously good engineer. It now feels like we know each other quite well, even though we’ve still never actually met. By the time Monday came around there was an infrastructure outline, a technical plan, and two small teams of developers at TPP and DataLab ready to go. At TPP, we’d also now been linked into Prof Liam Smeeth and the team at the LSHTM. Liam and Ben had put the plan and the ambition together from the outset.

What happened next is almost a blur now. Between Zoom calls, Slack channels, Google Docs, Hangouts, and Github, a project started to come together. There were some long days, late nights, and weekends, for sure. We shipped a lot of code over the next few weeks. Enormous kudos goes to both Jonny and Sam – two coders at TPP – for their role in making this happen. An extra mention goes to Jonny, who has simultaneously dealt with lockdown, work, and two new twins! (Read Jonny’s thoughts on the project here).

Running in parallel with technical developments was a significant workstream on information governance and ethics. Again, this was a collaborative effort between the teams, but it was coordinated and delivered by NHS England and NHSX. The project would have gone nowhere without their hard work and support. So, only a few weeks after the first phone call, we got permission to build the research platform. The permission enabled us to link data from primary care, hospitals, emergency departments, intensive care, and laboratories, with death certificates. We’ve also had the ongoing opportunity to revisit these permissions if and when the research teams have needed more data sets as the Covid situation has evolved. Again, there was more goodwill and hard work from organisations across the country to make this actually happen, in particular, from the analytic teams at ONS, ICNARC, and NHS England.

The result of all this work is the OpenSAFELY platform. It’s a linked, deidentified EHR data platform for over 24 million people in England. It was created under the government’s emergency Control of Patient Information (COPI) regulations for the purposes of Covid research. It’s not just a new platform though, it’s a completely new way of doing EHR research. We haven’t needed to move huge extracts of EHR data around to other research environments. The data stays where it is, inside the secure data centre where the patient records are already sorted. The analytics comes to the data. No data can ever leave the existing EHR infrastructure, only the research results. We’ve gone further than this too. The epidemiology teams can only access the aggregate, low-dimensional data they need to do their research. The high-dimensional, event-level research data remains hidden from them. This forms a highly-secure layer of privacy control, without sacrificing the detail in the underlying data sets. It’s a new type of trusted-research environment, with a bespoke build purely for high-powered EHR research.

The first research output from the platform was the largest Covid study performed anywhere in the world, an analysis of Covid risk factors from over 17 million individuals. The study received a lot of attention globally and has now been published in Nature. There are a number of TPP staff who co-authored the paper and a big acknowledgement for the work of the technical support teams here too. In fact, the technical approach received a lot of attention itself, with articles in The Economist, New York Times, and ZDNet. The next set of outputs are now ready for publication, including important analyses into Covid treatment and prevention.

The project is built on being open and transparent. All the outputs, the code, the SQL, the disease definitions, the terminology lists, and the research protocols are available publicly on the project website and in the GitHub repo. We’re encouraging other research teams to use these, to comment on them, and to contribute back. A priority now is also to onboard other teams, from both the technical and ethical viewpoints. This is work in progress, but we are still only about 12 weeks in!

It’s been a whirlwind of a project and one of the most extraordinary things I’ve ever worked on. Bringing together the EHR vendors and the researchers, supported by a new breed of ‘developer-epidemiologists’ inside academia, has proved to be a very powerful idea. This approach, together with the security and speed that can be achieved by leaving data in-situ, is a legacy I hope can continue. I’ve met some brilliant people, doing phenomenal, important research. One day, I hope to meet them in the real-world too.