Global HealthNewsNewsletter

Big Data Applications for Public Health

By Kushan de Silva, Master of Public Health, Lund University

“Big Data Applications for Public Health”


What is Big Data?

First coined by two NASA scientists, Cox and Ellsworth in 1997 to describe the graphical visualization pitfalls of large data sets and subsequently redefined with a positive connotation by Davenport as the broad range of new and massive data types that have appeared over the last decade, big data has become much more than a buzz word today¹. It is now generally agreed that big data are characterized by four v’s, namely, a big volume of data, a variety of data types, a high velocity of data generation and updating and a big value generated by analyses².

Public health applications

By enabling high-powered statistical analyses, big data has contributed immensely for the development of public health³. Big data’s value is in its use—compiling multitudinous information from multiple sources into a single dataset permits unearth patterns and associations that would otherwise be impossible. Big data technologies also enable to store huge data repositories indefinitely which can be used for purposes other than that for which they were originally collected⁴.

It is now common epidemiological practice to use big data derived from the pooling of cohorts, the –omics technologies, electronic health records, and social media for conducting research⁵. Smart devices and cloud computing have given rise to new public health big data analytics approaches such as web-based epidemic surveillance, sensor-based health conditions monitoring and genome-wide association studies. Social media mining for public health purposes is also gaining momentum. Twitter is perhaps the favorite pick in this area which has already been used for public health initiatives such as infectious disease surveillance, monitoring mass gatherings and building pharmacovigilance applications⁶.

An unequivocal merit of big data is that it sheds light on hidden disease patterns and causal loops which cannot be deciphered using conventional study designs. For example, a study which employed a Google-driven search for big data in autoimmune geoepidemiology extracting a sample size of 394,827 patients enabled the researchers to dissect hitherto unseen connections in systemic autoimmune disorders¹. Moreover, machine learning tools that can automatically detect multi-scale spatiotemporal epidemic patterns and “hotspots” have been developed which may eventually supersede the Centers for Disease Control (CDC) and the World Health Organization (WHO) disease surveillance initiatives⁷.

Molecular epidemiology as a new field is beginning to contribute to public health by combining genomics with clinical sciences. Thus, the gap between genomics research and clinical applications is narrowing as machine intelligence is being exploited to transferring genomics knowledge to practical use through personalized medicine and translational research efforts⁸.

Apparently, no single field in public health is left behind as Collins et al underscores the potential of big data to advance health economics as a discipline. He argues that larger patient datasets will allow a lot more real-world evidence to be generated and interactions between treatments to be understood better while having more biomonitoring and lifestyle data will enable health interventions to be tailored more effectively to individuals⁹. Perhaps on a more futuristic note, Rumsfeld et al¹⁰ predicts that there is a massive potential for big data analytics to improve cardiovascular quality of care and patient outcomes while an ambitious Young et al¹¹ writes on its potential for development of new tools and methods to address the HIV epidemic which can be applied to address HIV prevention.

Tools in a nutshell

Big data storage and transferring is usually done via NoSQL databases such as MongoDB, Cassandra, BigTable, and HBase. Computer scientists have simultaneously developed efficient distributed and parallel systems enriched with hardware architectures such as multicore and cluster computing that are capable of handling large scale data. Cluster file systems are required to accessing big data stored on a cluster. Hadoop Distributed File System (HDFS) is one such commonly used cluster file system which is designed for storing large amounts of data across machines in a large-scale cluster. Programmes  such as MapReduce and H20 are used for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A popular open-source implementation of MapReduce framework called Apache Hadoop is also available.

Big data analytics

Different analytic methods are required to tackle the intricacies of big data and the common methods are recommendation systems, deep learning and network analyses. As the traditional frequentist school of statistics was challenged by the advent of big data, newer analytic methods such as Bayesian techniques and supervised and unsupervised machine learning algorithms have come to rescue. In addition, there are many tools for visualization of big data such as Tableau, Matlab and Python, but perhaps the best known among them is R which is a powerful programming language with over 10000 packages that could achieve sophisticated functions. Frosting on the cake is that R also provides wrappers for other visualization and analytics software, for example, RCytoscape to Cytoscape, RCircos to Circos, sparkR to Spark and RHadoop to Hadoop.

Digital epidemiology

A new era in epidemiology has dawned as it is partnering with big data and this nascent, big data based branch of epidemiology is rightly referred to as digital epidemiology (DE). A major underpinning to this new development is that while the theoretical base of epidemiology will remain unaltered in the big data future, the field’s future value will depend on its integration with the technical savvy¹². Though premised on the same objectives as traditional epidemiology, DE focuses on electronic data sources, draws on developments such as the widespread availability of internet access, mobile devices, and online sharing platforms, which constantly generate colossal amounts of health-related data, even though they are not always collected for public health objectives. People are no longer required to connect with the health care system and bear the hassle of traditional data collection procedures as they can directly generate public health related information using online services. By utilizing global real-time data, DE is starting to bolster prompt disease outbreak detection. A classic example is the 2014 Ebola virus outbreak in West Africa which was detected by digital surveillance channels in advance of official reports¹³.

Ethical concerns

However, possible ethical concerns about big data such as the informed consent, the protection of individual privacy, the confidentiality and the data re identification are looming large⁵. Some research using social media has been criticized for a lack of informed consent. The greatest concern is that the use of personal health information or even the intentional manipulation of behavior in digitalized interventional studies is not subject to traditional, federal research ethics oversight. Therefore, safe de-identification of big data is critical to health care since the terms of service agreements, click-through consents, and voluntary oversight are grossly inadequate¹⁴.

Future of epidemiology

Big data trends will continue to influence the future directions in public health. The typical research design is reshaping as the antecedent definitions of the populations and their follow-up are expanding to encompass online populations. Traditional data collection methods, such as population-based surveillance and individual interviews, have been backed up with advanced instruments ranging from biomarkers to mobile health which would enhance the accuracy of data collection and interpretations. Epidemiology is shifting to a new paradigm as the risk factor analyses, prediction methods, and causal inference strategies are continuing to make significant contributions to public health by exploiting the wealth of information concealed in big data.



  1. Ramos-Casals M, Brito-Zerón P, Kostov B, Sisó-Almirall A, Bosch X, Buss D, Trilla A, Stone JH, Khamashta MA, Shoenfeld Y. Google-driven search for big data in autoimmune geoepidemiology: analysis of 394,827 patients with systemic autoimmune diseases. Autoimmunity reviews. 2015 Aug 31;14(8):670-9.
  2. Huang T, Lan L, Fang X, An P, Min J, Wang F. Promises and challenges of big data computing in health sciences. Big Data Research. 2015 Mar 31;2(1):2-11.
  3. Dimeglio C, Kelly-Irving M, Lang T, Delpierre C. Expectations and boundaries for Big Data approaches in social medicine. Journal of Forensic and Legal Medicine. 2016 Nov 24.
  4. Rosenbaum S, Thorpe JH, Gray EA. Big data and public health: navigating privacy laws to maximize potential. Public Health Reports. 2015 Mar;130(2):171-5.
  5. Salerno J, Knoppers BM, Lee LM, Hlaing WM, Goodman KW. Ethics, Big Data and Computing in Epidemiology and Public Health. Annals of Epidemiology. 2017 May 10.
  6. Conway M, O’Connor D. Social media, big data, and mental health: current advances and ethical implications. Current opinion in psychology. 2016 Jun 30;9:77-82.
  7. Pullum LL, Ramanathan A. Oak Ridge Biosurveillance Toolkit: Scalable machine learning for public health surveillance. InComputational Advances in Bio and Medical Sciences (ICCABS), 2014 IEEE 4th International Conference on 2014 Jun 2 (pp. 1-1). IEEE.
  8. Li X, Zhao X, Zhong M. Advancing public health genomics. InBig Data and Information Security (IWBIS), International Workshop on 2016 Oct 18 (pp. 15-18). IEEE.
  9. Collins B. Big Data and health economics: strengths, weaknesses, opportunities and threats. PharmacoEconomics. 2016 Feb 1;34(2):101-6.
  10. Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nature Reviews Cardiology. 2016 Mar 24.
  11. Young SD. A “big data” approach to HIV epidemiology and prevention. Preventive medicine. 2015 Jan 31;70:17-8.
  12. Mooney SJ, Westreich DJ, El-Sayed AM. Epidemiology in the era of big data. Epidemiology (Cambridge, Mass.). 2015 May;26(3):390.
  13. Vayena E, Salathé M, Madoff LC, Brownstein JS. Ethical challenges of big data in public health. PLoS Comput Biol. 2015 Feb 9;11(2):e1003904.
  14. Rothstein MA. Ethical issues in big data health research: currents in contemporary bioethics. The Journal of Law, Medicine & Ethics. 2015 Jun 1;43(2):425-9.


One thought on “Big Data Applications for Public Health

  1. Hi, Kushan, I like your article very much. And I love the technologies used in data science field, because it can really solve some practical problems , and risk predicting is a very interesting and challenging field. Do you know some phd programs in the field about risk prediction in Sweden? Maybe I want to apply.

Leave a Reply

Your email address will not be published. Required fields are marked *