Never before as in this last year, has the spotlight been on the key role that technological innovation can and must play in improving the quality of life. The use of Big Data in the health sector is becoming increasingly popular, albeit with some limitations, in order to analyse billions of micro-data and find cures for the most diverse diseases and/or viruses.
Big data also played a key role in the Covid-19 vaccine race.
WHAT ARE BIG DATA?
When we talk about big data, we are referring to a set of data that is so large in volume and so complex that traditional software and computer architectures are not able to capture, manage and process in a reasonable time.
While a traditional database can manage tables with millions of rows and tens or a few hundred columns, big data requires tools capable of managing the same number of records, but with thousands of columns.
Moreover, the data is often not available in structured form, easily pigeonholed into rows and columns, but is present in the form of documents, meta-data, geographic locations, values captured by IoT sensors and numerous other forms, from semi-structured to completely unstructured ones. Indeed, the data that make up big data repositories can come from heterogeneous sources, such as website navigation data, social media, desktop and mobile applications, but also from sensors embedded in thousands of objects that are part of the so-called Internet of Things (IoT).
THE ROLE OF BIG DATA IN THE FIGHTS AGAINTS COVID-19?
Currently in the fight against the Coronavirus, Ibm’s Summit supercomputer, the world’s most powerful supercomputer, is being used at Oak Ridge National Lab in Tennessee. With a computational power of 200 petaflops (1 petaflop is a unit of computational speed equal to one billion million floating-point operations per second at peak, equivalent to 200 million billion calculations per second), it was able to carry out a selection, on a simulated basis (so-called ‘in silico’ selection), of eight thousand compounds within a few days to model what could affect the infection process. Seventy-seven were identified as having the potential to impair Covid19’s ability to infect host cells.
In the laboratory, where real compounds are brought into contact with the virus to understand its reaction, this is a too slow process to be feasible because each variable can be composed of millions, if not billions, of possibilities with, in addition, the need to conduct multiple simulations. However, it must be said that the processing power of the supercomputer has been judged to be comparable to about 1% of that of the human brain.
BIG DATA AND ITS APPLICATION IN HEALTHCARE
From Big Data, useful information can be extracted to increase knowledge for clinical practice, of a different nature from that produced in the past, with respect to which it should not be seen as an alternative but as an important complement, integration and enhancement.
The generation of knowledge of the effectiveness of treatments in medicine is produced by prospective, mostly ‘randomised’ clinical trials, i.e. by comparing experimental treatments with the treatment considered to be the most effective and known. The comparison may be ‘blind’ (without doctors and patients knowing whether the patient is receiving the experimental treatment and the standard treatment) or ‘open’, and the objectives of the study are defined before the start of the study.
Clinical trials are expensive, complex to organise, often targeting groups of patients that are not representative of the real population as a whole, and focusing on extremely specific questions, outside of which applicability is not feasible.
With Big Data, what happens in the real world is normally analysed in a non-predetermined way, by carrying out analyses on the information available, even if it is different and collected for other purposes; this can allow us to choose questions which we can then try to answer with clinical trials. The virtuous circle of knowledge is based on the identification of relevant questions, the advancement of hypotheses and their verification, repeated cyclically in a loop aimed at the continuous improvement of evidence.
In addition, real-world observation makes it possible to assess the real effect of treatments (effectiveness) and the spread and application of known best practices.
THE LIMITS OF BIG DATA IN HEALTHCARE
The implementation of IT tools is a long-standing and increasingly widespread process. The storage of complex data, recorded and used electronically in the health field, has become widespread with the systematic adoption of programmes for diagnostic imaging, of more or less complete and complex systems of electronic medical records, of reporting and recording of outpatient, diagnostic, therapeutic, bureaucratic and organisational and financial management activities.
To be useful, data produced in different areas and situations should be as homogeneous in nature as possible. Data that can be classified in a standardised way have a specific meaning due to a corresponding metadata that places it in an ordered framework. Standardised classifications of the various categories are not always available, and even when they are, they are not always used in a comparable and sufficiently systematic way.
Health information records also have other characteristics that may diminish their actual value for two main reasons:
This makes the recognition of an entire pathway and the precise description of the overall history of a patient improbable, which in turn may have many phases of illness and even different, concomitant and intersecting pathologies, treated in various territorial, organisational and temporal contexts.
Such heterogeneous characteristics are traced in ways that are acceptable for rough accounting reasons, but very often inadequate to the precision required for clinical or scientific decisions and analyses or for organisational and economic evaluations that are important and decisive for efficiency and effectiveness. The quality and homogeneity of data in health care are often questionable and consequently interpretations can be unreliable due to incompleteness and difficulty of verification.
The lack of data structuring and the identification of common elements can be remedied with the use of syntactic analysis protocols, parsing and natural language processing (NLP) algorithms with the identification of information listed in text format with recognition processes such as semantic annotation. Within the masses of data, search algorithms are able to identify relationships and logical organisations that can fit into known ontologies.
The limitation is that it is not always easy to associate a meaning with the results of such analyses, and above all it is not always possible to attribute a definite causality to a correlation. There have been conclusions obtained with Big Data that have not been confirmed by subsequent prospective studies.