Researchers test the power of machine learning to study Covid for a long time

Long Covid, with its constellation of symptoms, is proving a difficult moving target for researchers trying to conduct large studies of the syndrome. As they aim, they debate how to responsibly use growing piles of real-world data – drawing on the full experiences of long Covid patients, not just their participation in supervised clinical trials.

“People have to really think carefully about what that means,” said Zack Strasser, an internist at Massachusetts General Hospital who used existing patient records to study the characteristics of the long Covid. “Is that true? Isn’t that an artifact that just happens because of the people we’re looking at in the electronic health record? Because there’s bias.

One of the biggest sources of actual long Covid data is a unique centralized federal database of electronic health records called the National Covid Cohort Collaborative, or N3C. Launched as part of a $25 million award from the National Institutes of Health early into the pandemic, N3C now includes anonymized patient data from 72 sites across the country, representing 13 million patients and nearly 5 million Covid cases.


“If we are able to identify these kind of constellations of symptoms that make up these potential long Covid subtypes, then, first of all, we might discover that long Covid is not a disease, but it is five diseases or 10 diseases,” Emily said. Pfaff, who co-leads the lengthy Covid task force at N3C. The real-world data effort gathered additional insights funding as part of RECOVER, the NIH’s four-year initiative to study long Covid, to further characterize the syndrome.

This work has begun to paint a clearer picture of the long Covid, more recently describing concurrent clusters of cardiopulmonary, neurological and metabolic diagnoses. But a firmer definition of the syndrome could also potentially support recruitment efforts for long critical Covid trials, some of which have been slow to progress.


“There are concerns that the long Covid trials will not be successful,” said Melissa Handel, a health informatics researcher at the University of Colorado Anschutz Medical Campus and co-lead of N3C, because its definition is still so diffuse.

Supporting more targeted recruitment is what Pfaff calls the “sweet spot” of the project. She and her colleagues hope that machine learning models could help identify potential participants who would otherwise be missed or underrepresented in prospective research. And using algorithmic approaches to narrow down a cohort of people who are more likely to have long Covid, Pfaff said, “a research coordinator calling potential participants makes calls from a list of 200 patients, instead of 2 million patients.”

This effort is still ongoing. The team’s first attempt to build an algorithm that could identify long Covid patients, published in a preprint now accepted at Lancet Digital Health, had its limitations. At this point, “there was literally no structured way for a physician to enter ‘I think this patient has long Covid’ into their EHR,” Pfaff said. “We had to get creative and find a proxy.” They settled on the records of around 500 patients who presented to three lengthy Covid specialist clinics.

The model performed decently when tested on records from a fourth clinic, distinguishing between long Covid clinic patients and non-patients with an area under the curve of 0.82, a precision measure used by machine learning researchers. But it was still based on a small number of patients who could be demographically skewed. And Pfaff pointed out that the data could overrepresent long Covid patients with respiratory symptoms because two of the clinics used for training the models were based in pulmonary wards.

Since this series of works, medicine has found a better awareness, if not necessarily a better understanding, of the long Covid. In October, providers were finally able to monitor long Covid patients with a dedicated diagnostic code this “will be very important for recruitment,” said Lorna Thorpe, co-researcher for RECOVER Clinical Science Core at NYU Langone Health. It can both provide a simple way to identify long Covid patients – there are 16,000 with the code in N3C so far – and help develop a clearer definition of the syndrome.

“Ultimately, the idea is to characterize the subtypes of long Covid that health care providers should expect to see in their clinics,” said Charisse Madlock-Brown, health informatician at the Science Center of health from the University of Tennessee and co-lead of the N3C social network. determinants of the health team.

But the code could also be used to refine the next generation of N3C models, teaching the algorithms what to look for in electronic health records that might suggest a patient has long Covid – even if the code n is not used.

“A big part of getting a long Covid diagnosis seems to have a lot to do with your access to care, as well as finding a doctor who even knows how long Covid is and is able to treat you” , said Pfaff. An algorithmic approach to recruitment could potentially help include patients who do not have this access.

So now the team is training models that learn both from patients at the clinic and from those whose doctors checked off the new diagnostic code, in hopes of defining a “best-in-breed” classifier. When the group applied the latest version to N3C records, it found 158,000 potential long-term Covid patients, Pfaff said.

This is not to say that the model can or should immediately turn to patient recruitment. Researchers from N3C and the broader RECOVER initiative emphasize that algorithmic approaches are not a silver bullet and will always need to be used in combination with human control to build study cohorts.

Indeed, any bias in the data used to train a long Covid model could lead to inaccurate predictions. And while the N3C records have been cleaned up so they’re ready for analysis, “there are caveats to this data,” said Leonie Misquitta, whose clinical innovation team at the National Center for Advancing Translational Sciences from the NIH manages the data platform. There are almost twice as many female patients with long Covid codes in the system as there are male patients – which could be the result of patient behaviors, coding practices, biological realities or all of the above. In a more egregious example, a clustering algorithm initially identified sexual activity as a comorbidity of long Covid due to how one site documented its patients.

“I think it’s an important approach. I’m super supportive of that, and we’re communicating that to the NIH,” Thorpe said. “But it won’t be the perfect solution. Let’s be realistic. Recruitment will increase, it will gradually improve, with all the different strategies that are applied.

The N3C team will continue to refine their models as new real-world data emerges. In particular, they are interested in building a machine learning classifier that could identify long Covid patients with subtypes of the disease, such as those with new-onset diabetes or certain types of diseases. kidneys. “It may be easier to find people with the most common phenotypes,” said Jasmin Divers, another manager of RECOVER’s real-world data efforts at NYU Langone. “But if you wanted to fill a specific subset that you don’t see as often, then having that enriched pool to draw from and recruit could be beneficial.”

And most importantly, they will aim to test their predictions on new data sets as they come in, to see if the results hold up across different healthcare systems. “In medicine, the stakes are always high,” Strasser said. “I always make the mistake of making sure things are working properly beforehand and things are really validated before moving forward with technology like this.”

But while they recognize the limitations of real-world datasets and the algorithms trained on them, the N3C researchers argue that using such models to identify trial cohorts is relatively low risk. “If someone from a university was running a long Covid trial and asked me if I felt comfortable applying this model to help them build a potential recruiting roster,” Pfaff said, “I would say without equivocal yes.” They could present certain recruiting sites with lists to follow, using a third-party intermediary to protect personally identifiable information, or give them code to run on their internal files to identify potential participants.

N3C executives said the platform was primed to support recruitment. Integrating group EHR resources with identifying clinical cohorts was part of the N3C’s initial proposals for funding RECOVER, but so far the NIH has not funded this use of the tool. “The kind of initial framing of the EHR cohorts work was rather a quick strike: Let’s understand [post-acute sequelae of SARS-CoV-2 infection], let’s characterize it. It was not in their contract with the NIH to do that,” Thorpe said.

“We have to wait for the NIH to say yes, these are the things we want you to prioritize and here’s the budget for those things,” Handel said. “Recruiting sites and the data engineering and N3C team are ready to do such things, but there needs to be resources and coordination.”

Leave a Reply

Your email address will not be published.