Arms race to exploit personal data exposed by Biobank breach

A massive security breach at UK Biobank has exposed a growing global arms race between research institutions and criminals seeking to exploit personal health data. The incident involved researchers from three Chinese academic institutions attempting to sell sensitive biological data from more than 500,000 volunteers on e-commerce platforms.

The breach represents a new type of threat that security experts say isn’t getting enough attention. Unlike traditional data hacks or leaks, this involved trusted researchers who had been vetted by the charity but then tried to profit from the data they were given access to for legitimate research purposes.

Last week’s revelation came after UK Biobank spent 18 months fighting to maintain security of its biological database. The charity discovered that previously vetted researchers linked to three Chinese academic institutions had created listings on Alibaba-owned e-commerce websites to sell the data.

UK Biobank quickly banned the three institutions and asked the government for help. British diplomats worked with the Chinese government and Alibaba to take down the listings. While the data contained no names, addresses or birth dates, there’s still risk that anonymized information can be pieced together to identify individuals.

Naomi Allen, UK Biobank’s chief scientist, said they had been “assured” that the data wasn’t “sold to third parties.” However, the incident highlights how vulnerable large medical databases have become as they grow in size and value.

The UK Biobank breach was just one of several major data security incidents that emerged last week:

  • Hackers stole 19 million records from the French agency managing driving licenses, passports and ID cards
  • Booking.com suffered a major data breach
  • Home security firm ADT was also hacked
  • The UK was hit by 8.5 million cybercrimes last year

Data breaches became a serious problem for UK Biobank in 2024 when academic journals began requiring researchers to publish the computer code they used to analyze large medical datasets. Sometimes that code, usually published on GitHub, included raw data from the charity.

“We’ve got a machine-learning algorithm that does a daily trawl of all open-source repositories and we check that it doesn’t include any data,” Allen explained. “When we do find it, we get the researcher to take it down immediately. Or if we can’t find the researcher, GitHub takes it down for us. That has been really successful.”

Ethics researchers have warned that this type of threat – data misuse rather than data leaks or hacks – isn’t taken seriously enough by biobanks. This assessment was based on a 2021 analysis of BBMRI-ERIC, an association of more than 550 European biobanks.

UK Biobank represents one of the first major attempts to gather large amounts of medical data so researchers could identify links between different biological processes. Recruitment began in 2003 of people aged 40 to 69 who agreed to blood tests, body and brain scans, and detailed health and lifestyle questionnaires.

The program has been enormously successful in advancing medical science. More than 22,000 researchers have accessed its data, producing 18,000 research papers. Doctors can now analyze heart scans in seconds using AI developed with UK Biobank data, and NHS clinics can diagnose dementia in minutes.

However, UK Biobank’s size and success also make it more vulnerable. When it began sharing data with researchers in 2012, it allowed them to download complete datasets. Now the database is nearly 40 petabytes – about 40 million gigabytes. A high-speed home internet connection would take more than 10 years to download it all.

Most large health data repositories launched since then have taken a different approach. Programs like Finland’s FinnGen, Germany’s Nako, and the US-based All of Us and Million Veteran Program keep data in the cloud and force researchers to do their analysis there instead of downloading raw data.

UK Biobank began moving to a similar cloud-based system in 2021, but the security measure proved highly unpopular with researchers because it was more expensive. Timothy Raben, a geneticist, said in 2024 that some research groups would have to abandon projects because the costs had become “prohibitive.”

Meanwhile, China’s Kadoorie Biobank still appears to allow researchers to download data directly. Industry sources say biotech firms are increasingly turning to Chinese data sources as Western repositories tighten security measures.

“The most secure data set is one that’s locked away and is never used,” Allen said. “We want to make rapid progress into the causes and treatments of disease. It’s a balance between making the data available and advancing science versus ensuring the security of the data.”

There are risks in both sharing data and not sharing it, she explained. “Scientific progress will not be made if you don’t have the global collaborative community working on these data to make those discoveries. And I think that trade off is difficult to get right, because the technology is moving all the time.”

The incident underscores a broader challenge facing medical research as valuable health databases become targets for both cybercriminals and rogue researchers. As these repositories grow larger and more valuable, the security measures needed to protect them may increasingly conflict with the open collaboration that drives scientific progress.