AI in Healthcare: Algorithmic Bias and Health Data

The Promise of Precision Medicine

The convergence of artificial intelligence with biomedical data represents one of the most consequential technological developments in modern medicine. For decades, clinicians have recognised that medicine practised at the population level. Applying average outcomes from clinical trials to individual patients necessarily fails some portion of those patients, whose biology, genetics, social circumstances, and comorbidities differ from the trial population in ways that alter the risk-benefit calculus of treatment. Precision medicine, the project of tailoring diagnosis and treatment to the individual patient's specific characteristics, has been a long-standing aspiration. AI, operating on datasets of a scale and dimensionality that overwhelm traditional statistical methods, has made that aspiration seem newly achievable.

Topol (2019), in his survey of high-performance medicine, described the convergence of genomics, wearable sensors, electronic health records, and deep learning as creating the conditions for a "creative destruction of medicine," a phrase simultaneously optimistic about capability and honest about the disruption involved. The clinical applications Topol surveyed were genuinely impressive: deep learning systems matching or exceeding specialist performance at reading electrocardiograms, detecting diabetic retinopathy from fundus photographs, classifying skin lesions, and predicting patient deterioration from time-series vital sign data. The benchmarks were real and meaningful. The limitations, however, were frequently understated in the same literature that celebrated the achievements.

The fundamental promise of AI in medicine is that it can find signal in complexity that human cognition cannot efficiently process. A physician reviewing a patient's history, physical examination findings, laboratory results, and imaging studies is performing an impressive feat of synthesis, but one limited by memory, attention, time, and the cognitive biases that accompany human reasoning under uncertainty. An AI system trained on millions of patient records can identify correlations invisible to any individual clinician and apply those correlations consistently at scale, without fatigue, without the availability bias that makes recently-seen diagnoses disproportionately salient, and without the variation in clinical practice that makes outcomes geography-dependent in ways that are difficult to justify on clinical grounds.

These theoretical advantages are real. The empirical literature documenting AI systems that perform at or above specialist level in specific, well-defined tasks is now extensive. The path from these controlled demonstrations to routine clinical deployment is, however, far longer and more treacherous than early enthusiasm acknowledged. Performance on curated benchmark datasets does not straightforwardly predict performance on the messy, heterogeneous data of real clinical practice. The populations on which AI systems are trained are frequently not representative of the populations on which they will be deployed. The feedback loops that allow human clinicians to learn from errors do not exist in the same form for deployed AI systems. The structural features of healthcare, including the power dynamics, economic incentives, regulatory frameworks, and cultural norms, shape AI deployment in ways that the technical research literature largely does not address.

Perhaps most importantly, the aspiration of precision medicine contains within it an implicit claim about data: that more data, better integrated and more intelligently analysed, will yield better medicine. This claim is broadly correct. But it carries a corollary that is rarely examined with comparable rigour: that the benefits of data-intensive precision medicine will flow equitably to all patients. The evidence, examined carefully, does not support that corollary. The patients whose data is most abundant in training datasets, and whose outcomes AI systems are therefore best positioned to optimise, are disproportionately affluent, educated, and of European ancestry. This is not incidental. It is a predictable consequence of historical patterns of health data collection and a structural challenge that technical solutions alone cannot adequately address.

Algorithmic Bias in Clinical Decision Support

The most important single empirical contribution to the academic literature on AI and healthcare equity was published in Science in 2019 by Obermeyer and colleagues. Their study examined a widely-used commercial algorithm deployed across US health systems to identify patients who would benefit from "high-risk care management" programmes, and found that it exhibited substantial racial bias. Specifically, the algorithm consistently assigned lower risk scores to Black patients than to White patients with the same level of objective clinical need, as measured by chronic illness burden. The practical consequence was that Black patients had to be considerably sicker than White patients to qualify for care management programmes offering supplemental services that could meaningfully affect their health outcomes.

The mechanism of the bias was not a flaw in the algorithm's programming. It was a consequence of what the algorithm was trained to predict. The algorithm used healthcare costs as a proxy for health need: a seemingly reasonable choice, given that costs are readily available in administrative claims data and would presumably track the severity of medical conditions. But costs, as a reflection of actual healthcare consumption, are also a reflection of systemic inequities in access to care. Black patients, facing structural barriers to healthcare access including transportation, insurance gaps, discrimination, and the legacy of historical medical mistrust, consume less healthcare than White patients with equivalent health needs. An algorithm trained to predict costs therefore learned to predict healthcare utilisation, which is substantially determined by access, rather than healthcare need, which is substantially determined by biology and circumstance.

The Obermeyer study estimated that correcting the algorithm's risk scores by recalibrating it to predict health need rather than healthcare costs would have increased the proportion of Black patients identified as high-risk from 17.7% to 46.5%, an increase of more than 160%. This magnitude of effect was not a marginal calibration issue. It was a fundamental failure of the system to perform its claimed function equitably, with directly harmful consequences for the patients who most needed the services it was supposed to allocate.

Adamson and Smith (2018), writing in JAMA Dermatology, documented a related pattern in dermatological AI. Skin condition classifiers trained predominantly on images of lighter skin tones performed significantly worse on darker skin tones. The disparity was driven by data: clinical image databases used for training were unrepresentative of the diversity of patient populations, systematically over-representing patients with lighter complexions in whom dermatological conditions had historically been more extensively photographed and documented. The consequences in deployment were predictable: higher error rates for the patients whose correct diagnosis was already most likely to be delayed by factors including inadequate training of clinicians on dermatological presentation in darker skin.

Mittelstadt and colleagues (2016) provided a philosophical framework for understanding algorithmic bias in healthcare as part of a broader mapping of the ethical challenges posed by algorithms in high-stakes domains. They identified six dimensions of ethical concern: inscrutability (the difficulty of understanding how algorithmic decisions are made), incorrectness (systematic errors embedded in training or objective functions), discrimination (differential performance across protected characteristics), privacy (the use of personal data in ways individuals have not consented to), autonomy (the displacement of individual clinical judgment), and transformation of moral responsibility (the diffusion of accountability that algorithmic mediation creates). Each of these dimensions is present in clinical AI, and they interact in ways that compound their individual effects.

Char, Shah, and Magnus (2018), writing in the New England Journal of Medicine, drew attention to what they called the "valley of death" between AI research publication and clinical deployment: the gap between demonstrating that a system performs well on a benchmark and demonstrating that it performs well and safely in practice, for all patient populations, in the workflow conditions of real clinical settings. They argued that the ethical challenges of clinical AI implementation, including obtaining meaningful consent, defining accountability for algorithmic errors, ensuring equitable performance across populations, and maintaining clinician oversight, were not technical problems solvable within the research paradigm and required explicit, prospective engagement from clinical ethics and health policy.

The regulatory dimension of algorithmic bias in clinical AI was addressed by Wu and colleagues (2021) in their analysis of FDA approvals of AI-enabled medical devices. They found that the majority of approved devices were validated on datasets that were non-representative of the US patient population by race, sex, and age, and that post-market performance monitoring requirements were frequently inadequate to detect differential performance across demographic groups. The approval process, designed for a world of physical medical devices with stable performance characteristics, was poorly adapted to software-based AI systems whose performance could vary significantly across deployment contexts and whose behaviour could change as the models were updated.

The Health Data Marketplace

Health data is among the most sensitive categories of personal information: a comprehensive record of vulnerabilities, behaviours, relationships, and risk factors that can affect employment, insurance, relationships, and life opportunities. It is also, by virtue of its informativeness, among the most commercially valuable. The intersection of these two characteristics has created an expanding and poorly-regulated marketplace in which health data flows far beyond the clinical relationships in which it was generated, often without the knowledge or meaningful consent of the individuals it describes.

Price and Cohen (2019), in their analysis of privacy in the age of medical big data, documented the multiple vectors through which health data enters commercial markets. Electronic health records sold or licensed to research consortia, pharmaceutical companies, and AI developers. Claims data from insurance companies aggregated and sold by data brokers. Consumer health data from wearables, apps, and search engines that is not subject to HIPAA (which governs only certain covered entities, not the broader consumer health data market). Genomic data from direct-to-consumer testing companies operating under terms of service that permit commercial use of de-identified data. The cumulative picture is of a market in which the notion of data generated exclusively for individual clinical benefit is increasingly a legal fiction.

The NHS/Google DeepMind partnership, examined by the UK Information Commissioner's Office in 2017, became a landmark case study in the gap between stated purposes and actual data flows. The Royal Free NHS Foundation Trust had provided approximately 1.6 million patient records to DeepMind for the development of an app called Streams, designed to alert clinicians to patients at risk of acute kidney injury. The ICO found that the data had been transferred without adequate legal basis and without appropriate patient notification. Patients had not been informed that their records were being used in this way, and the data transferred was substantially broader than what was required for the stated application. The case illustrated how the acquisition of health data for AI development could proceed through institutional channels, at massive scale, in ways that were technically within existing legal frameworks but violated the reasonable expectations of the patients whose data was involved.

The commercial logic of health data is straightforward and powerful: data that enables better risk stratification, more accurate diagnosis, more effective treatment selection, or more precise drug development is worth enormous sums to pharmaceutical companies, insurers, and healthcare providers. The individuals who generated that data, through the vulnerability of illness and through encounters with clinical care systems, capture essentially none of that value. This asymmetry is not unique to health data, but it is more morally fraught in health than in most other domains, because the power differential between patients and healthcare institutions is acute, consent is compromised by the urgency of care needs, and the potential for data-derived decisions to harm the data's subjects, through insurance discrimination, employment decisions, or differential access to care, is concrete and proximate.

The re-identification of nominally anonymised health data is a persistent technical challenge that makes the concept of de-identification less protective than it sounds. Research has repeatedly demonstrated that combinations of variables that individually seem non-identifying, such as age, sex, postcode, and diagnosis dates, can uniquely identify individuals in large datasets with high probability. Genomic data presents a particularly acute challenge: it is inherently identifying (it encodes ancestry, family relationships, and individual distinctiveness), it is permanent (unlike a password, you cannot change your genome), and it is consequential (it predicts health risks, carries insurance implications, and provides identifying information about family members who have not consented to share their data).

The regulatory framework for health data in the United States remains fragmented. HIPAA provides protection for data held by covered entities (healthcare providers, health plans, healthcare clearinghouses) and their business associates, but excludes the vast majority of consumer health data generated outside formal clinical settings. State-level privacy laws, including California's CPRA and Colorado's CPA, provide some additional protections but vary significantly in their scope and enforcement. The EU's GDPR and its sector-specific complement in healthcare provide stronger baseline protections and have served as a template for proposed reforms in other jurisdictions, but enforcement has been inconsistent across member states.

AI in Diagnostic Imaging

Diagnostic imaging covers radiology, pathology, ophthalmology, and dermatology. It represents the clinical domain in which AI has achieved the most robust and extensively replicated performance benchmarks. The reasons are structural: imaging produces standardised, high-dimensional data amenable to machine learning analysis; there are large, well-curated training datasets available; and the task of classification from images is one for which deep learning methods have demonstrated particular strength across many domains.

Rajpurkar and colleagues (2017) at Stanford demonstrated that CheXNet, a convolutional neural network trained on 112,120 chest X-rays, achieved radiologist-level performance on pneumonia detection and outperformed a panel of four radiologists on that specific task according to the paper's reported metrics. The finding was widely reported as evidence that AI would soon render radiologists redundant, a conclusion that subsequent analysis showed was substantially overstated. The comparison was between the AI system and individual radiologists working without access to the clinical context that radiologists typically have (patient history, physical examination, prior imaging, clinical notes), and the benchmark reflected performance on a curated test set that did not capture the full range of imaging presentations in clinical practice.

Esteva and colleagues (2017), also from Stanford, demonstrated that a convolutional neural network could classify skin cancer lesions at a level comparable to board-certified dermatologists, achieving equivalent sensitivity and specificity on a binary task of distinguishing malignant from benign lesions in a curated dataset. The paper appeared in Nature and attracted significant attention. Subsequent work examining the same and similar systems on external validation sets, including datasets from different institutions, different cameras, and different patient populations, found substantially degraded performance, particularly for skin tones underrepresented in the training data.

The FDA's approval process for AI-enabled medical devices accelerated significantly after 2019, with hundreds of AI/ML-enabled devices receiving marketing authorisation by 2024. The great majority were imaging-related. Wu and colleagues (2021) found that performance claims in FDA submissions were typically based on single-institution datasets, that demographic subgroup analyses were infrequently reported, and that post-market requirements for ongoing performance monitoring were typically minimal. The gap between the controlled conditions of regulatory submission and the heterogeneous conditions of clinical deployment created risks that the approval pathway was not consistently addressing.

The deployment of AI diagnostic tools in clinical practice raised fundamental questions about the allocation of responsibility for diagnostic errors. When a radiologist misses a finding on a chest CT, the error can be attributed, investigated, and learned from within existing professional accountability frameworks. When an AI system misses the same finding, the accountability is distributed among the system's developers, the institution that deployed it, the radiologist who reviewed the AI output, and the regulatory body that approved the device. This diffusion of accountability is not merely a legal problem. It affects whether errors are identified, investigated, and corrected, which determines whether AI diagnostic systems improve over time in the way that human clinicians improve through feedback.

The clinical workflow integration challenge, separate from the question of algorithmic performance, has proven to be one of the most practically significant barriers to beneficial AI deployment in radiology. AI diagnostic tools are not free-standing clinical services; they must be integrated into the radiology workflow in ways that preserve, rather than undermine, the clinician's ability to exercise independent judgment. Research on "automation bias," the tendency of human operators to over-trust automated recommendations and under-weight their own contrary observations, suggests that poorly designed AI integration can degrade overall system performance even when the AI component performs well in isolation, by anchoring clinicians to AI outputs that are incorrect and dampening the critical evaluation those outputs should receive.

Automation and Clinical Displacement

The question of whether AI will displace clinical workers, and if so which categories, has generated extensive commentary and rather less empirical certainty. The dramatic early claims that radiologists would be made redundant within a decade and that AI would render general practitioners unnecessary have not materialised on the projected timelines, and careful analysis suggests they were based on a misunderstanding of what radiologists and GPs actually do that makes them clinically valuable.

Topol (2019) drew an important distinction between tasks that AI can automate within clinical roles and the wholesale replacement of clinical roles themselves. Radiology, for example, involves image interpretation, the task where AI has demonstrated strong benchmark performance, but also clinical consultation, procedure guidance, report communication, integration of imaging findings with clinical context, and the management of incidental findings that require clinical judgment about follow-up. Automating image interpretation, even perfectly, would automate one component of radiology practice rather than the practice itself.

The displacement that has occurred has been more subtle and in some ways more troubling than wholesale role elimination. AI and digital tools have changed the composition of clinical work, increasing the proportion of time spent on documentation, quality assurance, administrative tasks, and the supervision of automated outputs, while reducing the proportion of time spent on direct patient interaction. This is not universally experienced as an improvement by clinicians, many of whom entered their professions specifically for the quality of human contact involved. Burnout data consistently identifies documentation burden and loss of clinical autonomy as primary drivers of clinician dissatisfaction. AI tools, as currently deployed,, have frequently increased documentation requirements rather than reducing them.

At the level of the health workforce as a whole, the concern is not primarily about clinician displacement but about clinical task allocation. AI-enabled task automation creates the technical possibility of assigning clinical tasks to workers with lower formal qualifications, using AI as a substitute for specialist judgment rather than as an augmentation of it. The economic incentives for healthcare systems to exploit this possibility are real. Whether it results in equivalent or better patient outcomes, and for which patient populations, are empirical questions that require careful study rather than confident assertion in either direction.

Nursing and allied health professions face a different version of the automation question. Tasks that are physically performed by nurses, such as patient mobility assistance, medication administration, and vital sign monitoring, are candidates for robotic automation. Tasks that involve communication, emotional support, relationship-building, and the exercise of clinical judgment in complex interpersonal contexts are substantially more resistant to automation, though not entirely immune as conversational AI systems improve. The distribution of automation risk across nursing tasks is uneven in ways that have significant implications for nursing practice and for the therapeutic relationship that nurses maintain with patients.

The informed consent framework that governs clinical research and, in principle, clinical care was developed in a context where the encounter between patient and healthcare system was bounded, identifiable, and finite. A patient consenting to a clinical procedure or a research study understood (in the idealised model) what was being proposed, what the alternatives were, what the risks were, and what the benefits might be. That information was sufficient to support a meaningful choice. AI-enabled medicine has introduced conditions that the consent framework struggles to accommodate.

When a patient's electronic health record data is used to train an AI clinical decision support system, the "consent" involved is typically buried in a terms-of-service document signed at registration, referencing data use policies that most patients have never read. The specific use, training an AI system that will make recommendations affecting the care of future patients, is neither disclosed nor explained in terms accessible to the average patient. The data may be used by the healthcare institution itself, by a technology partner, or by a third-party AI developer, with each step in the chain involving additional complexity that the original consent relationship was not designed to manage.

Price and Cohen (2019) note that HIPAA's provisions for research use of health data, through the "minimum necessary" standard and the de-identification pathway, were designed for an era of hypothesis-driven medical research in which data was used to answer specific, pre-specified questions. AI model training does not fit this paradigm cleanly. Training a large AI model involves exposing it to the entirety of available data, not a theoretically justified subset. The concept of "minimum necessary" data for AI training is difficult to operationalise meaningfully.

Regulatory oversight of AI in clinical practice has been fragmented and inconsistent. The FDA's Software as a Medical Device (SaMD) framework provided a pathway for AI clinical tools to receive regulatory approval, but the majority of AI tools deployed in clinical settings, including clinical decision support tools explicitly exempted from device regulation under the 21st Century Cures Act, did not require pre-market regulatory review. The exemption, intended to facilitate innovation in clinical software, created a category of clinical tools with significant potential to affect patient safety that received no mandatory efficacy or safety evaluation before deployment.

Mittelstadt and colleagues (2016) argued that the opacity of algorithmic decision-making, specifically the difficulty of understanding why an AI system produces a particular output, creates accountability gaps that existing governance frameworks are ill-equipped to address. In clinical contexts, this opacity has specific consequences. A physician who makes a clinical error can be asked to explain their reasoning, and that explanation can be evaluated by peers, disciplinary bodies, and courts. An AI system that produces a harmful recommendation cannot provide this explanation in the same sense. The "black box" character of deep learning systems, technically explicable to experts through attention maps and feature importance analyses but not interpretable in the way that clinical reasoning is interpretable, creates a genuine accountability deficit.

The International Medical Device Regulators Forum (IMDRF) and the FDA have worked toward frameworks for adaptive AI/ML medical devices, meaning systems whose algorithms change over time as they are exposed to new data, that attempt to address this challenge. Continuous learning systems raise particularly acute oversight questions: if a system's algorithm changes after deployment, how does the regulatory authorisation granted at the time of initial approval remain valid? How are downstream effects on patient care monitored? How are clinicians informed of model updates that may change the system's recommendations? These questions remain without fully satisfactory answers across regulatory jurisdictions.

Equitable AI in Healthcare: What It Would Require

AI in healthcare today is characterised by demonstrated technical capability in constrained domains, significant bias in deployment, inadequate oversight frameworks, and inequitable data practices. The path to a future in which AI genuinely reduces health disparities and improves outcomes equitably across populations is neither short nor simple. But the components of that path are sufficiently clear that the absence of progress along them can be recognised as a choice rather than an inevitability.

Representative training data is the most foundational requirement. AI systems trained on non-representative datasets will not perform equitably across the full diversity of patient populations, and no amount of post-hoc calibration fully compensates for training data that systematically underrepresents specific populations. This requires active investment in data collection from underrepresented communities, including communities that have historically had reason to distrust medical research institutions. The Tuskegee Syphilis Study and other historical research abuses created a legacy of justified mistrust in African American communities that continues to affect research participation and clinical data availability. Equitable AI in healthcare is inseparable from the broader project of repairing the relationship between medical institutions and underserved communities.

Prospective algorithmic auditing, meaning mandatory evaluation of AI system performance across demographic subgroups before deployment and continuously thereafter, is a technical requirement that has regulatory implications. The FDA's action plan for AI/ML-based software as a medical device, published in 2021, acknowledged the need for transparency about training data characteristics and performance across demographic groups. Implementation has been uneven. A regulatory requirement for pre-deployment demographic subgroup performance disclosure, with meaningful thresholds for acceptable disparity, would create enforceable standards where voluntary commitments have proven insufficient.

Meaningful consent for health data use in AI development requires a different architecture than the current terms-of-service model. Several proposals offer paths toward more genuine consent, including dynamic consent platforms that allow patients to make granular, ongoing choices about data use, and fiduciary frameworks that require healthcare institutions to act in patients' interests in data governance decisions. None has achieved widespread adoption, partly because the current system is financially advantageous to the institutions that would have to change it.

Clinical AI governance at the institutional level, including multidisciplinary review of AI tools before deployment, ongoing performance monitoring, and clear accountability structures for AI-related adverse events, is achievable within existing institutional frameworks and does not require regulatory change. The American Medical Association, the American College of Radiology, and other professional organisations have issued guidance on AI governance. Uptake has been variable, and the guidance is frequently insufficiently specific to provide operational direction. More detailed, mandatory governance frameworks, analogous to institutional review board requirements for human subjects research, would raise baseline standards across the sector.

Ultimately, equitable AI in healthcare requires confronting the fact that healthcare inequity is not primarily a data problem or an algorithm problem. It is a structural problem, rooted in the distribution of resources, power, and opportunity in the broader society. AI cannot correct inequities in access to care, in exposure to environmental health risks, in the economic determinants of health, or in the structural racism that produces differential health outcomes across racial and ethnic groups. What AI can do is avoid amplifying those inequities by learning from data that reflects them, and be designed and deployed in ways that explicitly prioritise equity rather than treating it as a secondary consideration. That requires technical choices, regulatory requirements, and institutional commitments. It also requires political will, which is ultimately a social question rather than a technical one.

References

Obermeyer, Z., et al. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.
Topol, E.J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25, 44–56.
Rajpurkar, P., et al. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv:1711.05225.
NHS/Google DeepMind Streams app investigation. ICO (2017). Royal Free–Google DeepMind trial. Information Commissioner's Office.
Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115–118.
Mittelstadt, B.D., et al. (2016). The ethics of algorithms: Mapping the debate. Big Data and Society, 3(2).
Price, W.N., & Cohen, I.G. (2019). Privacy in the age of medical big data. Nature Medicine, 25, 37–43.
Wu, E., et al. (2021). How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine, 27, 582–584.
Char, D.S., Shah, N.H., & Magnus, D. (2018). Implementing machine learning in health care: addressing ethical challenges. New England Journal of Medicine, 378, 981–983.
Adamson, A.S., & Smith, A. (2018). Machine learning and health care disparities in dermatology. JAMA Dermatology, 154(11), 1247–1248.

AI in Healthcare: Algorithmic Bias, the Commodification of Health Data, and Automation in Clinical Practice

The Promise of Precision Medicine

Algorithmic Bias in Clinical Decision Support

The Health Data Marketplace

AI in Diagnostic Imaging

Automation and Clinical Displacement

Equitable AI in Healthcare: What It Would Require

References

Further Reading

AI in Healthcare: Algorithmic Bias, the Commodification of Health Data, and Automation in Clinical Practice

The Promise of Precision Medicine

Algorithmic Bias in Clinical Decision Support

The Health Data Marketplace

AI in Diagnostic Imaging

Automation and Clinical Displacement

Consent, Privacy, and Oversight Gaps

Equitable AI in Healthcare: What It Would Require

References

Further Reading