Big data plays an increasingly important role in medicine and in public health studies. The adoption of electronic medical records (EMRs), arrays of continuous biometric sensors, and collaborations among global organizations that collect, analyze, and study massive datasets present a potential Pandora’s box. We can train machine learning engines with accurate, protected data to improve medical diagnosis, treatment plan effectiveness, and patient outcomes. Poor quality or tainted datasets can waste time and money, however, and possibly cause great harm. As the saying goes, “garbage in, garbage out.” While it may sound reassuring to read that a research team trained an algorithm with 700,000 data points from previously documented cases, reliance on big data for direction or decision support requires a leap of trust. How do we know where the data came from and whether it’s any good?

Questions and concerns about data quality drive work of the Big Data Steering Group, a body established by the European Medicines Agency (EMA) and the Heads of Medicines Agencies (HMA). The Big Data Steering Group recently published a third workplan. This latest workplan outlines the major components that the group intends to deliver during 2022 to 2025. The Big Data Steering Group’s purpose is to “enhance the efficient integration of data analysis into the evaluation of medicinal products by regulators.” Earlier work assessed the challenges and opportunities in using big data to regulate medical developments, including treatments and improved outcomes. The previous task group also presented prioritized recommendations on using and generating data.

The workplan for 2022 to 2025 includes four key components. The first deliverable is a Data Analysis and Real World Interrogation Network (DARWIN EU) to support studies in the EU with real-world evidence. Second is a common data quality framework for all stakeholders. Next, the group will publish a good practices guide for data discoverability. The last key deliverable is training for regulators in biostatistics, pharmacoepidemiology, and data science.

As digital methods and technologies continue to proliferate, the Big Data Steering Group’s work is an excellent effort to help keep all parties on track. As more studies rely on data, it’s imperative that someone watches over the data to protect the study outcomes.