Humpty Dumpty fell off a wall, and all the king’s horses and men couldn’t put him back together again. As the second law of thermodynamics predicts, Humpty could never become whole again spontaneously — fixing Humpty would require outside effort, and even then, Humpty may never return fully to his original state.
Researchers recently reported in JAMA that the Epic Sepsis Model (ESM) may be less accurate than claimed, owing to poor discrimination and calibration in predicting the onset of sepsis. The University of Michigan study evaluated the hospitalization-level performance of the model using standard measures (e.g., sensitivity, specificity, positive predictive value, negative predictive value), including the area under the curve, or AUC. [As a quick aside, the AUC and its quirky full name, “area under the receiver operator characteristic curve,” stems from performance measurement of World War II radar receiver operators.] The Michigan finding, that Epic’s model had an AUC of 0.63, fell far short of the performance data provided by Epic. For context, an AUC of 1.0 means a model’s predictions are perfect, an AUC of 0.5 means the model is not at all predictive, and an AUC of 0.6 to 0.7 is generally considered poor.
The ESM’s fall should prompt health systems to revisit their important prediction models. Lives, dollars, and reputations are at stake.
1. Some models are not transparent
When prediction models are relatively simple and based on logistic regression, their developers can easily share the variables and their coefficients. But when models are based on machine learning, their developers may find it extremely challenging to describe, let alone understand, how the algorithm makes predictions. Complicating matters further, proprietary prediction models are typically opaque to protect intellectual property. ESM customers, including Michigan Medicine, apparently have access to details of the model — so much so that they could externally validate the model.
Takeaway: Health systems need to assess the transparency of a prediction model and determine whether the risks of adopting an opaque model are worth the risks.
2. Some models lack extensive external validation
When evaluating new technologies, health systems desire results from comparable health systems. Perhaps Epic furnishes prospective customers with compelling external validations not found in the public domain. External validation of prediction models is critically important to ensure that results are generalizable. Prior to the external validation of the ESM in Michigan, an external validation of the same model by a five-hospital system in Colorado measured its AUC at 0.73. Given the widespread adoption of the ESM, is that level of validation sufficient? In a JAMA editorial accompanying the Michigan study, Habib et al wisely noted that a prediction model calibrated with data from one time and place needs to be validated and recalibrated in new eras and settings. Populations and clinical practice can vary over time and place.
Takeaway: Health systems need to assess whether the external validation of a prediction model is sufficient prior to trial, and then evaluate it after a demo — and again after full implementation.
3. Some models lack rigorous internal validation
Further upstream, internal validation should be rigorous. As Martens et al have smartly noted, external validation is needed only when prediction models are worth it. Epic developed and validated its ESM with data from three US health systems from 2013 to 2015. Was that enough data? If data from more health systems were used initially, then the model’s results may have been more generalizable. If in practice the model’s output corresponded too closely to the initial three health systems (i.e., overfitted), then the ESM could have been less generalizable from the get-go.
Takeaway: Health systems should ensure that the sampling and analytic methods used to develop a prediction model were rigorous.
4. Some models use different definitions
Discussing the study in a JAMA podcast, Karandeep Singh, MD, one of the Michigan study coauthors, said, “In 2019, around the same time that we collected data for this study, EPIC had shown us [using our data] that the model was achieving an area under the curve of 0.88 at the same time that we were demonstrating the AUC to be 0.63.” Dr. Singh believes the lower AUC reported by his team comes down to differences in how the onset of sepsis was defined. According to Dr. Singh, EPIC’s model was based solely on billing codes and “sepsis onset” was based on clinical action to treat it — in other words, assuming that the onset of sepsis occurred when clinicians recognized it and treated it, and not necessarily when it actually occurred. Dr. Singh sums up this point by saying “The EPIC Sepsis model … in my view, is designed to tell clinicians what they already know.” Ouch!
Takeaway: Health systems should continuously review and, when necessary or desirable, revise definitions and then reconfirm that the resulting predictions are acceptable.
5. Some models use different thresholds
The Michigan study used an ESM score threshold (sometimes referred to as a cutoff) of 6 or higher, the same value Michigan Medicine’s operations committee selected to generate alerts. This value is apparently within the range recommended by Epic. As a result, patients assigned scores of 6 or higher are assumed to have sepsis, and patients assigned scores below 6 are assumed to not have sepsis. In contrast, the Colorado study used an ESM score threshold of 5 or higher. Lowering the threshold (e.g., 5 or higher) increases the model’s sensitivity but also the proportion of false positives that can increase alert fatigue; raising the threshold (e.g., 7 or higher) increases the model’s specificity but also the proportion of false negatives that can increase the number of avoidable adverse events. When predictions of who has and does not have a condition are less than perfect — such is the case with sepsis — the threshold, or cutoff, is an arbitrary line that attempts to strike an appropriate balance between sensitivity and specificity. It’s a tradeoff.
Takeaway: Health systems need to make measured decisions when selecting a threshold and account for variable thresholds when comparing and contrasting results over time and between health systems.
Putting it together
Health systems “own” the data used to develop and validate their prediction models. And large health systems, particularly those associated with academic medical centers like Michigan Medicine, have the human resources to validate these models — if not develop them themselves. Further, “R,” the open-source statistical computing and graphics software capable of running advanced analytics used by the Michigan study researchers, is free. When validating their important prediction models, health systems should not be walking on eggshells.
The second law of thermodynamics predicts that the performance of prediction models will remain the same or decline over time without inputs, including new data. Health systems should embrace this reality and either continuously invest in their most important prediction models or prepare for another great fall.
No Comments