Big Data has yet to hit the big time
pharmafile | February 16, 2015 | Feature | Manufacturing and Production, Medical Communications, Research and Development, Sales and Marketing | Astellas, Big Data, chris poole, data, development, mark thornton
Big data has taken quite a knocking over the past year. First, from some sceptics who have warned big data is a big distraction from the real business of commercial research; second, the UK government’s plans to extend access to patient records through the care.data scheme backfired, creating a bit of a PR storm.
But at Astellas, we are convinced that the petabytes of real world clinical information out there have a lot to tell us. The UK government in its Plan for Growth, Science and Innovation, published just before Christmas, agrees.
Announcing £113 million investment into the Hartree computing centre in Daresbury, Chesire, the report declared “addressing the challenge of Big Data is not a ‘nice to have’: it is now a fundamental enabling competence”.
Of course, there has been such a momentum building up over big data these past few years, it was inevitable that some people should warn us to curb our enthusiasm. This kind of backlash does happen when key stakeholders do not feel properly consulted.
In a feature in the Financial Times last March, the ‘undercover economist’ Tim Harford argued big data had become “an obsession with entrepreneurs, scientists, governments and the media”, and that it “is a vague term, often thrown around by people with something to sell”.
He even quoted fellow statistician David Spiegelhalter, Winton professor of the public understanding of risk at Cambridge University, who has described the hype around big data as ‘absolute nonsense’.
There is merit in some of the things Tim Harford wrote – big data is not an end in itself. But he overplayed his hand. Those of us who have been analysing large data sets for years know the pitfalls and the difficulties in extracting the meaningful information from amidst the noise.
Astellas started a new project in 2013, ‘Data to Knowledge’, to see how we can maximise the value of the real world data that are out there. This is also about building the company’s internal capacity, to bring the expertise in-house, rather than using external contractors.
We are not alone. We know other industry leaders have created health informatics factories, with scores of analysts and multi-million dollar data centres. When you add in the infrastructure needed to run operations like this, and the licences for the datasets to be analysed, you can see this is serious investment.
What has changed is the science around the analytics. The statistical approaches have changed beyond recognition. As an industry, we are used to the standard set by randomised clinical trials, where statistical methods have remained essentially the same for decades.
But the noisy datasets that real world data yield have forced analysts to find ingenious solutions over the past ten years for how to handle, for example, biases that would otherwise seriously compromise our interpretation of the analyses.
The other change is the sheer growth in accumulation of data, and the computing capacity to handle it. In a traditional randomised trial, we may deal with a few thousand patients, with a specific and limited number of data items.
When we tap into the Clinical Practice Research Datalink (CPRD) here in the UK, we have access to six million current patient records, plus more in the archive, each with a huge number of fields, few of which will have been designed with our research purposes in mind.
It is a fantastic resource – the equivalent of 70 million patient years of experience. But finding the intelligence within it takes serious number crunching.
You can buy the computing hardware to do it. But you are committing yourself to millions of pounds of investment building your own data centre, and it is not obvious beforehand that the resource will be used to maximum capacity, or if it will handle the peak data flow rates when the going gets tough. What is more, in five years, the kit will be out of date and need replacing.
What we’ve been impressed by in recent years, alongside the growing capacity of the internet, is the growth in cloud computing. There are dozens of providers who rent out computing power by the hour.
Within our project, cloud computing is the obvious answer. We don’t need the upfront investment. We can move fast. And the approach is highly scalable – if the data demand grows, you just have to rent more processing power, and when throughput is slow, you are not paying for unused capacity. The important thing is to have the right database structure beforehand, and the right analytics software.
That’s the process. What about the data?
Through the CPRD, the UK government has made available a rich set of real world data, linking not only general practice records, but also hospital episode statistics, the Office for National Statistics central death register, the UK National Cancer Intelligence Network, other disease registries, and census data.
And it is those linkages that make the data ‘sing’, because they give us much more detail about the way patients move through the NHS. Its history goes back more than 20 years, under the guise of the General Practice Research Database.
But in its ‘Plan for Growth’ of 2011, the present government foresaw that in expanded form, NHS records “could offer unique opportunities for this country’s international competitiveness in health research,” and “would support clinical innovation and strengthen evidence of effectiveness, improving health outcomes”.
And as well as bringing together several important registries into a single data warehouse, the Medicines and Healthcare Products Regulatory Authority, which runs the CPRD, has made access to the raw data much easier, which presents a terrific opportunity to increase the throughput of research in the therapeutic areas that interest us.
What we are doing now is to develop the in-house infrastructure to manage that opportunity. The fact that we can access these data is a result of a number of happy accidents. UK GPs pioneered computerisation of patient records to improve efficiency, so that a family doctor did not have to thumb through pages of notes before the patient came in.
The government’s hospital episodes statistics infrastructure was introduced for accountancy purposes – to enable hospitals to be fairly paid for the treatments given to patients. No one would have imagined 30 years on that epidemiologists would be trawling through the entries to see if some health intervention or other was effective, and under what circumstances.
But hospital episodes statistics now accumulate 18 million new inpatient records every year, with over 350 data items per record. There are 40 million new outpatient records each year. Patient reported outcome measures are being included for some conditions. There is clearly a rich seam of epidemiological information to be mined.
And because every NHS patient has a unique number, even though the data are anonymised, we have a single identifier we can use to track their progress through the system and along the natural history of their disease.
The other unique aspect of healthcare provision in the UK is that nearly everyone uses a single provider, the NHS, ensuring that even in a 10% sample such as the CPRD, we can have confidence that our analyses are representative of the wider population.
That is quite different from the equivalent data we can get from the US, such as from health insurance records. Because patients will often switch insurer when they move job, or because they find a better deal elsewhere, continuous records are much more likely to be for short periods.
And because insurers are in competition, they have no interest in sharing or linking their records. Also there are demographic biases because of who is insured, or who is on the Medicare federal system, which have to be accounted for.
In Europe the story is different, with variable practices from country to country. Scandinavian record keeping is quite similar to ours. In Spain, records are well integrated at the regional level, but not nationally so.
But other European record holders tend to be more wary of sharing the data with commercial organisations, so we have to engage with academic clinical partners to process those data at arm’s length.
Because the medical records available from the CPRD were intended for a completely different purpose, they are of more variable quality compared to the carefully designed data we get from clinical trials.
They are often incomplete. They are noisy – the data are not specific in the way we might wish. For example, in one of our therapeutic areas, pain, our trial end points might be targeted against a patient’s self-reported grading of pain on a scale from 0 to 10.
You will not find that kind of detail in the hospital episode statistics, because hospitals are not paid according to the degree of pain their patients suffer. In urology, success of a treatment could be measured in terms of number of visits to the toilet, or the volume of urine passed.
Again, those are details a clinician might hear, but not note down precisely.
We know that, because we have looked for this level of detail in the CPRD data sets, and it is not there in the structured data.
This is part of the challenge of mining big data – the statistics have to work harder to draw out the information which is only variably recorded. One of our ambitions is to combine the kind of high-quality records we generate in prospective clinical trials with the noisy, patchy information available in the retrospective records that come from healthcare systems.
We see this developing rapidly over the next two to five years. When we have those combined datasets, we will do a much better job of translating our clinical trial data into real world effectiveness measures.
It is very encouraging that the CPRD clearly recognise the value of talking to their commercial partners, and are making a substantial investment to accommodate these kinds of requests. We expect to see an acceleration of this type of joint working between pharma and data custodians like the CPRD, in order to build more detail into patient records in key therapeutic areas.
Benefits for pharma
There is a great model for the benefits from this approach in a trial initiated by GSK, the Salford Lung Project, which is trialling an experimental asthma treatment.
The project is a collaboration between GSK and a group called North West e-Health (NWeH), which is itself a partnership between Manchester University and Salford Royal NHS Foundation Trust.
What the collaboration has added is to establish a data network, developed by NWeH, linking participants’ NHS records with the trial data, giving a much closer indicator of the impact on emergency admissions.
The fact the project is happening signals an increased willingness of NHS customers to work much closer with pharma, which is very encouraging. And they can see not only the patient benefit in terms of improved outcomes, but also the potential benefits to care commissioning groups in terms of reduced costs.
This is the kind of collaboration we think would help us develop more effective trials. But it does take a lot of thought about how you integrate very different
There are a host of other ways big data can support the work of pharma companies, from health economics, testing commercial opportunities, simulations for clinical trial feasibility studies, and finding leads for drug discovery.
For drug leads, the key will be to integrate DNA data as well, and the growth in biobanks is an important development here. The UK Biobank, for example, has recruited 500,000 volunteers, who have given blood, saliva and urine samples, and detailed medical histories through interviews.
Screens for 850,000 genetic variants and a wide range of biochemical markers will be made accessible this year. Further clinical records are being added to the repositories all the time, like X-ray and MRI scans. And, like the CPRD, UK BioBank is very outward looking.
‘Come and get it!’ the leaders invited industry in an editorial in Science Translational Medicine last year. They continued: “UK Biobank is an open-access resource that encourages researchers from around the world – including those from the commercial sector.”
More importantly, as part of the grand vision to maximise the value of UK health data, the Biobank datasets are linked to the CPRD warehouse, with its unique patient identifiers.
Call us greedy, but we would like more. The care.data initiative would give 100% coverage of the UK population, rather than the limited subset accessible through CPRD. That could greatly improve the quality of the information we could build on.
But the attempted launch of care.data a year ago created a PR furore, with newspaper headlines leaping on potential problems with privacy.
Even though by the end of last year, it was clear that medical charities, colleges and researchers welcomed the health opportunities care.data would open up (see for example the November report by the All Party Parliamentary Group for patient and public involvement in health and social care), even the pilot phase remains log-jammed because of concerns about how the repository would operate.
That there are difficulties to be sorted should come as no surprise. As new opportunities come our way thick and fast thanks to the growth of the digital world, so too will issues we have not had to deal with before.
We need to reassure the public the data we use will be kept secure, that we have rigorous controls in place to make sure nothing untoward happens with the records, and that the benefits in terms of future healthcare will outweigh any hypothetical drawbacks.
We should also not oversell what big data can do for us. The sceptics are right that analytics cannot replace our other tools in innovation. But we need to embrace the ocean of information that is currently going to waste if we are to find better ways of improving health.
Chris Poole works within the health economics outcomes research unit at Astellas Pharma Europe, and Mark Thornton operates corporate strategy at
GSK has announced data from its new global survey about shingles. The data suggests that …
Bristol Myers Squibb (BMS) recently held a Research and Development (R&D) Day in New York, …