A
Q&A with IBM’s Dr. Jimeng Sun
|
Dr. Jimeng Sun |
Dr.
Jimeng Sun joined IBM Research after earning his PhD in healthcare analytics
research from Carnegie
Mellon University.
For the past four years, he has been developing data mining algorithms and
systems for healthcare analytic applications. His team, in partnership with
Sutter Health and Geisinger Health Systems, recently earned a
$2 million grant
from the National Institutes of Health to “develop new analytics methods to
help predict and identify early signs of risk for heart failure.”
Sun’s
analytics research will help develop accurate and robust predictive models for
the early detection of heart failure using Electronic Health Records (EHR)
data.
How did your team access EHR data in order
to test this analytics technology?
It
took more than six months to get access to real EHR data needed to build our models.
Aside from the legal agreement, we also implemented a set of elaborated
procedures to receive and access EHR data in order to guard the security and
privacy of actual patient data.
We
also created a course on Protected Health Information (PHI) which we require of
our researchers and developers who work on such data. It’s an issue we have to
take seriously.
Medical text is also incorporated into this
analytics technology. Is this the Watson technology? If not, how is it
different?
We
do utilize Unstructured Information Management Architecture (
UIMA) to extract the known
signs and symptoms to heart failure from available text. A similar Natural
Language Processing (NLP) technology is also used in
Watson
– the difference is our usage after the data extraction.
We
use those extractions as features, along with many other features from
structured information to feed the subsequence analysis.
What kind of data from the EHRs indicated a
higher risk of heart disease?
For
this specific dataset with Geisinger Health Systems, we have information on
more than 30,000 patients, among which about 5,000 of them are confirmed heart
failure patients. The other patients in the datasets are controls.
These
patient records account for more than 10 years of longitudinal data. While some
patient records have more information than another, it is a very impressive and
comprehensive dataset for studying heart failure.
The
challenge for differentiating heart failure patients from the controls, prior
to diagnosis, is that there is no single strong indicator. But there are many
weak indicators called co-morbidities, such as hypertension and diabetes,
associated medications and
Framingham
heart failure symptoms that we can extract from text. The hope is by
combining a large number of weak indicators we can still develop an accurate
and robust predictive model.
What kind of IT infrastructure does this
analytics technology need?
To
facilitate and speed up the model development, we use a Hadoop cluster to
manage and schedule tens of thousands of models in parallel. We are able to
reduce the amount of time for a large scale model building – which typically
needs 9 days in a single computer – to just three hours on a small cluster by
developing the predictive models in
Hadoop.
What would a doctor – who is using this
technology versus what is available today – see in a patient’s EHR that was not
there before?
With
new patients, the data collection has to start from scratch. But overall, as
the patients stay in with the same doctor or clinic over a long time (which is
the case with Geisinger’s dataset), we will begin to know more and more about
those patients – and utilize that information for prediction.
The
validation in an operational clinical setting is the next step after the
current project. If a patient has high risk of developing heart failure based
on our predictive model, the system will alert doctors with the risk level, and
associated risk factors derived from similar patients in the past.
The NIH grant is for the next three years.
What are the next steps in the partnership with Sutter Health and Geisinger
Health Systems to test these predictive methods for heart failure?
We hope to conduct a subsequent clinical trial of the
resulting predictive model. Through the trail, we want to show whether a randomly
selected group of patients that use the predictive model to facilitate the
clinical decision making is better than the current clinical practice. Besides unstructured
text and other structured medical information, we will also look into other
data source such as Electrocardiography (ECG) and genomic data.
Labels: EHR, Geisinger Health Systems, healthcare, heart failure, Sutter Health