Presentation
DH14 - They Said, It Said: Applying User Research to Test and Refine a Machine Learning Model for Sepsis Recognition
DescriptionBACKGROUND:
Research on applying Machine Learning (ML) to healthcare is expanding rapidly. The application of user-centered methods to ML has also been a growing topic of research represented by Human-Centered Machine Learning (HCML), Explainable Artificial Intelligence (XAI), Interactive Machine Learning (IML), and Human in the Loop (HITL), all of which promote the importance of trust, “explainability,” user interfaces accessible to non-expert users, and methods to engage end users in ML development.[1,2] Human factors professionals have the opportunity to contribute to this work, but there is a need for training in data science and to adapt human-centered design methods to ML.[3]
Sepsis is a major cause of severe illness and death in infants. Although sepsis is less common in healthy full-term infants, it is significantly more prevalent in those with medical complications, making diagnosis challenging. Early diagnosis and treatment of sepsis are crucial to prevent infant mortality.[4] Our data science team developed an ML model to improve sepsis recognition in neonatal intensive care unit (NICU) patients. The model was trained from our NICU sepsis registry automatically populated with electronic health record (EHR) data including demographics, vital signs (hourly), diagnosis, antibiotics, and labs from the entire hospitalization. 618 patients, with 1,188 sepsis evaluations (110 culture positive, 265 clinical sepsis diagnosis, 813 negative evaluations) were used to train the model.[5]
The model was trained on 28 patient features. Baseline features include age, diagnosis, mechanical support, and other information. Dynamic features include vital signs, FIO2, apnea/bradycardia events and other symptoms. The model output includes an hourly assessment of sepsis risk and Shapley Additive Values (SHAP) that indicate the importance of each feature to the prediction.
Our NIH grant (1R01LM013526-01A1, Grundmeier/Harris) with the Children’s Hospital of Philadelphia and Cincinnati Children’s Hospital aims to develop methods of representing the model to NICU clinicians to improve sepsis recognition. In initial user-centered work, we interviewed 31 NICU clinicians to elicit information on sepsis recognition, and to develop user profiles and use cases.[5] We performed an additional 30 interviews for more detailed descriptions of sepsis recognition.
METHODS:
We developed mockups to present model outputs for formative user testing. The design was based on XAI literature and strategies to display SHAP values for a single observation. The mockup has 4 components: 1) Numeric/graphic presentation of model sepsis risk. 2) Table of feature value (hourly). 3) Graph of all features and SHAP values. 4) Ordered graph of most important features and SHAP values.
We randomly sampled 30 patients from the registry to test with the mockup. We quickly discovered static mockups could not support the exploration of multiple patients or a single patient over time. To address this, we created a lower fidelity version of the mockup in Excel that displayed over 70 model data variables and used R to produce time-series graphs of model data. This approach allowed us to step through and visualize 72 hours of model output for each patient, 48 hours prior to sepsis evaluation and 24 hours after. The mockup supported a quick turnaround in uploading modified model data for an iterative review process.
Our team, including a neonatologist, informatics expert physician, data scientist, developer analyst, and human factors (HF)/human-computer interaction (HCI) specialists, reviewed multiple patients with the mockup and discovered anomalies that were explored in more detail via clinician interview results, patient chart reviews by the neonatologist, and insight into the ML by data scientists.
RESULTS:
The mockup facilitated review revealed multiple inconsistencies between the model output and user-research analysis. These anomalies could be classified as: 1) feature inclusion/exclusion, 2) feature importance/unimportance, and 3) feature values over time. The review resulted in over forty code changes to the model documented in our Global Information Tracker (GIT) repository.
Examples of feature inclusion/exclusion included lethargy and mean arterial pressure (MAP). Practically all NICU clinicians interviewed described lethargy as one of the most important indicators of neonatal sepsis. While somewhat subjective in comparison to a vital sign or other symptoms, mental status is documented in the EHR, and the model was modified to include this feature. MAP is highly correlated with systolic blood pressure (included in the model) and was determined to be redundant and removed from the model.
Examples of feature importance were exemplified by apnea/bradycardia or episodic events. Most interview participants described the importance of episodic events to sepsis recognition. Despite this, the model rarely indicated this feature as important, even in severely ill patients with sepsis confirmed, and where a chart review noted concern over repeated and worsening episodic events. Episodic events are charted in EHR flowsheets by type (apnea or bradycardia), frequency, and duration. Frequency and duration were described by interview participants as important to sepsis recognition, and many discussed subtleties of less severe yet frequent events. By contrast the model treated episodic events as a binary variable based on any event occurring within 12 hours. The model was adjusted to address event frequency (though not type or duration) and this change demonstrated an increase in importance of this feature.
Inconsistent feature and SHAP values over time revealed issues with model access to EHR data. For example, clinician data entry errors in temperature and other vital signs impacted the model, often dramatically. Other anomalies revealed errors or required modifications in how the model accessed EHR data (e.g., errors in accessing data on central lines, and modification of a feature identifying a diagnosis of chronic lung disease).
Overall, the approach combining user research data, mockup user interfaces with real patient and model data, and collaboration between team members from HF/HCI, medicine, and data science in the review, supported identification of anomalies in the model that would not otherwise be detected.
DISCUSSION:
The process we applied to iterative ML model refinement was discovered ad hoc, but was driven by user centered data, methods, and objectives. While by no means complete or fully developed, we believe our work demonstrates opportunities to explore the application of user-centered methods to the development and validation of ML models by incorporating data from user research and prototyping tools that support dynamic and efficient visualization/exploration of model output. However, the collaboration and joint expertise of human factors experts, clinicians, and data scientists was critical. No single area of expertise alone could fully identify, explain, or resolve the machine learning (ML) anomalies.
REFERENCES:
1. Fiebrink, R., & Gillies, M. (2018). Introduction to the special issue on human-centered machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS), 8(2), 1-7.
2. Dudley, J. J., & Kristensson, P. O. (2018). A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS), 8(2), 1-37.
3. Hannon, D., …., & Lee, J. D. (2019, November). A human factors engineering education perspective on data science, machine learning and automation. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 63, No. 1, pp. 488-492). Sage CA: Los Angeles, CA: SAGE Publications.
4. Rudd, K. E., Johnson, S., ... & Naghavi, M. (2020). Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study. The Lancet, 395(10219), 200-211.
5. Masino, A. J., Harris, M. C., Forsyth, D., ... & Grundmeier, R. W. (2019). Machine learning models for early sepsis recognition in the neonatal intensive care unit using readily available electronic health record data. PloS one, 14(2), e0212665.
6. Karavite, D. J., Harris, M. C., … & Muthu, N. (2022). Using a Sociotechnical Model to Understand Challenges with Sepsis Recognition among Critically Ill Infants. ACI Open, 6(02), e57-e65.
Research on applying Machine Learning (ML) to healthcare is expanding rapidly. The application of user-centered methods to ML has also been a growing topic of research represented by Human-Centered Machine Learning (HCML), Explainable Artificial Intelligence (XAI), Interactive Machine Learning (IML), and Human in the Loop (HITL), all of which promote the importance of trust, “explainability,” user interfaces accessible to non-expert users, and methods to engage end users in ML development.[1,2] Human factors professionals have the opportunity to contribute to this work, but there is a need for training in data science and to adapt human-centered design methods to ML.[3]
Sepsis is a major cause of severe illness and death in infants. Although sepsis is less common in healthy full-term infants, it is significantly more prevalent in those with medical complications, making diagnosis challenging. Early diagnosis and treatment of sepsis are crucial to prevent infant mortality.[4] Our data science team developed an ML model to improve sepsis recognition in neonatal intensive care unit (NICU) patients. The model was trained from our NICU sepsis registry automatically populated with electronic health record (EHR) data including demographics, vital signs (hourly), diagnosis, antibiotics, and labs from the entire hospitalization. 618 patients, with 1,188 sepsis evaluations (110 culture positive, 265 clinical sepsis diagnosis, 813 negative evaluations) were used to train the model.[5]
The model was trained on 28 patient features. Baseline features include age, diagnosis, mechanical support, and other information. Dynamic features include vital signs, FIO2, apnea/bradycardia events and other symptoms. The model output includes an hourly assessment of sepsis risk and Shapley Additive Values (SHAP) that indicate the importance of each feature to the prediction.
Our NIH grant (1R01LM013526-01A1, Grundmeier/Harris) with the Children’s Hospital of Philadelphia and Cincinnati Children’s Hospital aims to develop methods of representing the model to NICU clinicians to improve sepsis recognition. In initial user-centered work, we interviewed 31 NICU clinicians to elicit information on sepsis recognition, and to develop user profiles and use cases.[5] We performed an additional 30 interviews for more detailed descriptions of sepsis recognition.
METHODS:
We developed mockups to present model outputs for formative user testing. The design was based on XAI literature and strategies to display SHAP values for a single observation. The mockup has 4 components: 1) Numeric/graphic presentation of model sepsis risk. 2) Table of feature value (hourly). 3) Graph of all features and SHAP values. 4) Ordered graph of most important features and SHAP values.
We randomly sampled 30 patients from the registry to test with the mockup. We quickly discovered static mockups could not support the exploration of multiple patients or a single patient over time. To address this, we created a lower fidelity version of the mockup in Excel that displayed over 70 model data variables and used R to produce time-series graphs of model data. This approach allowed us to step through and visualize 72 hours of model output for each patient, 48 hours prior to sepsis evaluation and 24 hours after. The mockup supported a quick turnaround in uploading modified model data for an iterative review process.
Our team, including a neonatologist, informatics expert physician, data scientist, developer analyst, and human factors (HF)/human-computer interaction (HCI) specialists, reviewed multiple patients with the mockup and discovered anomalies that were explored in more detail via clinician interview results, patient chart reviews by the neonatologist, and insight into the ML by data scientists.
RESULTS:
The mockup facilitated review revealed multiple inconsistencies between the model output and user-research analysis. These anomalies could be classified as: 1) feature inclusion/exclusion, 2) feature importance/unimportance, and 3) feature values over time. The review resulted in over forty code changes to the model documented in our Global Information Tracker (GIT) repository.
Examples of feature inclusion/exclusion included lethargy and mean arterial pressure (MAP). Practically all NICU clinicians interviewed described lethargy as one of the most important indicators of neonatal sepsis. While somewhat subjective in comparison to a vital sign or other symptoms, mental status is documented in the EHR, and the model was modified to include this feature. MAP is highly correlated with systolic blood pressure (included in the model) and was determined to be redundant and removed from the model.
Examples of feature importance were exemplified by apnea/bradycardia or episodic events. Most interview participants described the importance of episodic events to sepsis recognition. Despite this, the model rarely indicated this feature as important, even in severely ill patients with sepsis confirmed, and where a chart review noted concern over repeated and worsening episodic events. Episodic events are charted in EHR flowsheets by type (apnea or bradycardia), frequency, and duration. Frequency and duration were described by interview participants as important to sepsis recognition, and many discussed subtleties of less severe yet frequent events. By contrast the model treated episodic events as a binary variable based on any event occurring within 12 hours. The model was adjusted to address event frequency (though not type or duration) and this change demonstrated an increase in importance of this feature.
Inconsistent feature and SHAP values over time revealed issues with model access to EHR data. For example, clinician data entry errors in temperature and other vital signs impacted the model, often dramatically. Other anomalies revealed errors or required modifications in how the model accessed EHR data (e.g., errors in accessing data on central lines, and modification of a feature identifying a diagnosis of chronic lung disease).
Overall, the approach combining user research data, mockup user interfaces with real patient and model data, and collaboration between team members from HF/HCI, medicine, and data science in the review, supported identification of anomalies in the model that would not otherwise be detected.
DISCUSSION:
The process we applied to iterative ML model refinement was discovered ad hoc, but was driven by user centered data, methods, and objectives. While by no means complete or fully developed, we believe our work demonstrates opportunities to explore the application of user-centered methods to the development and validation of ML models by incorporating data from user research and prototyping tools that support dynamic and efficient visualization/exploration of model output. However, the collaboration and joint expertise of human factors experts, clinicians, and data scientists was critical. No single area of expertise alone could fully identify, explain, or resolve the machine learning (ML) anomalies.
REFERENCES:
1. Fiebrink, R., & Gillies, M. (2018). Introduction to the special issue on human-centered machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS), 8(2), 1-7.
2. Dudley, J. J., & Kristensson, P. O. (2018). A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS), 8(2), 1-37.
3. Hannon, D., …., & Lee, J. D. (2019, November). A human factors engineering education perspective on data science, machine learning and automation. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 63, No. 1, pp. 488-492). Sage CA: Los Angeles, CA: SAGE Publications.
4. Rudd, K. E., Johnson, S., ... & Naghavi, M. (2020). Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study. The Lancet, 395(10219), 200-211.
5. Masino, A. J., Harris, M. C., Forsyth, D., ... & Grundmeier, R. W. (2019). Machine learning models for early sepsis recognition in the neonatal intensive care unit using readily available electronic health record data. PloS one, 14(2), e0212665.
6. Karavite, D. J., Harris, M. C., … & Muthu, N. (2022). Using a Sociotechnical Model to Understand Challenges with Sepsis Recognition among Critically Ill Infants. ACI Open, 6(02), e57-e65.
Event Type
Poster Presentation
TimeTuesday, March 264:45pm - 6:15pm CDT
LocationSalon C
Digital Health
Simulation and Education
Hospital Environments
Medical and Drug Delivery Devices
Patient Safety Research and Initiatives