characterising treatment pathways at scale using the OHDSI network

This post is about the paper Characterizing treatment pathways at scale using the OHDSI network from the hefty author list: George Hripcsak, Patrick B. Ryan, Jon D. Duke, Nigam H. Shah, Rae Woong Park, Vojtech Huser, Marc A. Suchard, Martijn J. Schuemie, Frank J. DeFalco, Adler Perotte, Juan M. Banda, Christian G. Reich, Lisa M. Schilling, Michael E. Matheny, Daniella Meeker, Nicole Pratt, and David Madigan.

Let's have at it. Note: including figures is needlessly time-consuming for me, so I'm going to refer to the paper assuming you have it to hand.


They looked at which medications patients received, for one of three diseases (type 2 diabetes, hypertension, depression), considering sequences of medications. Diabetes treatment is mostly dominated by metformin, and there is more variation for the other diseases. Many patients only ever receive metformin. They break it down by medical centre and find hetereogeneity between centres (and thus countries). Heterogeneity suggests we attempt to generalise with care.

What is OHDSI?

Pronounced 'Odyssey', OHDSI is the Observational Health Data Sciences and Informatics collaboration. From the website, 'OHDSI has established an international network of researchers and observational health databases with a central coordinating center housed at Columbia University.' I was shamefully unaware of its existence, despite it being very relevant to my interests. Evidence-based medicine through data analaysis! International collaboration! Open source! Reproducibility! All great. Fawning section over, on to the contents of the paper.

What did they do?

They analysed data from the OHDSI collection of databases to look at treatment pathways (ordered sequences of medications given to a patient) for three diseases: hypertension, diabetes mellitus type 2, and depression. Details in subsequent sections.

Why did they do it?

This feels like a proof-of-concept paper to me. The concept being that large-scale collaborations involving multiple health centres are possible, and that insights can be gained from analysis of the data. Essentially, the mission of OHDSI. More specifically, supporting the use of observational data to supplement medical research, which classically relies heavily on clinical trials. Observational data is 'free' in a sense (data-collection and storage, privacy-violating concerns temporarily aside), can cover wider populations and goes on indefinitely. Exploiting that has clear benefits. They highlight three key areas of benefit:

  1. Identifying which current therapies should be compared with a new therapy (for experimental design)
  2. Testing clinical hypotheses on observational data (acknowledging the need to do the appropriate statistical modelling)
  3. Understanding population characteristics to aid in extrapolation of results (both observational and experimental)

This study focuses mainly on the first point, as the look at medication trends.

Data resources

OHDSI, at the time of writing, has 52 databases containing 682 million patient records. For this study they used 11 databases with 250 million records. I don't know why they didn't use all the data. These databases were: (this is Table 2)

  • AUSOM (Ajou University School of Medicine, Korea)
  • CCAE (MarketScan Commerical Claims and Encounters, I guess USA)
  • CPRD (UK Clinical Practice Research Datalink)
  • CUMC (Columbia University Medical Centre, USA)
  • GE (General Electric Centricity, I guess USA)
  • INPC (Regenstrief Institute, Indiana Network for Patient Care, USA)
  • JMDC (Japan Medical Data Center)
  • MDCD (MarketScan Medicaid Mult-state, USA)
  • MDCR (MarketScan Medicare Supplement and Coordination of Benefits)
  • OPTUM (Optum ClinFormatics, I guess USA)
  • STRIDE (Stanford Translational Research Integrated Database Environment, USA)

So that's one from the UK, one from Japan, one from Korea and eight from the USA. The biggest population by far was CCAE, which contributed 119 million patients. Japan and Korea only comprised 5 million patients together, and the UK 11 million, so most of these patients are in the USA.

The databases have various types of data in them, which is of great interest to me, but in this study they just extracted medications.

Data processing

Filtering for patients

So: which patients did they include in the analysis?

Patients had to satisfy:

  • ≥ 4 continuous years in the database
    • ≥ 1 year before any treatment for that disease
    • ≥ 3 years of continuous treatment after that (this means patients who died during the period were excluded)
  • ≥ 1 diagnosis code for corresponding disease
  • 0 diagnosis codes for excluded diagnoses (these were: pregnancy for all, diabetes type 1 for diabetes type 2, and bipolar 1 disorder or schizophrenia for depression)

This resulted in 1,182,792 hypertension patients, 327,110 diabetes patients, 264,841 depression patients. I'm not sure what the breakdown by centre was.

Excluding patients who died during that period seems problematic to me, because that's probably not a random event. I worry about excluding subpopulations with more aggressive forms of the disease, or excluding badly-treated patients (although that's slightly outside the scope of this paper I think, but is a question of particular interest to me). The phenotype here is already incredibly broadly defined - what if the observed heterogeneity in treatment pathways is due to such subpopulations? I'm not sure what a better approach here would have been, though - exclude patients who died of reasons unrelated to the disease, perhaps?

Data standardisation

Diagnoses were defined by mapping SNOMED (Systematized Nomenclature of Medicine) and Medical Dictionary for Regulatory Activities to ICD-9-CM (International Classification of Diseases, ninth revision, clinical modification). Medications were defined by their ingredients using RxNorm, and grouped according to classification hierarchies (such as, they state, Anatomical Therapeutic Chemical classification and First Data Bank's terminology). I'm not especially familiar with these ontologies, except for SNOMED. Most of what I've done to date involved UMLS (which contains SNOMED and possibly everything else that has existed).

Constructing medication sequences

Having filtered to these patients they queried the OHDSI databases for the sequences of medications for these patients. Some notes on this:

  • sequences were limited to a maximum of 20 medications
  • if a patient switched from one medication and then later back to it, only the first exposure was recorded
  • combination medications (with multiple active ingredients) were treated as prescriptions of multiple single-ingredient medicines
  • I don't think the time between medications is considered - they're just ordered sequences of drugs

Having defined these sequences, they then counted the numbers of patients with each sequence and did other analyses. For example, they looked at medication classes, which are listed in table 1.

What did they find?

Also known as: let's look at the figures!

Figure 2

Which drugs do patients get first? Is there a standard entry into treatment-for-disease?

For diabetes, it seems yes. 76% of patients start with metformin. For hypertension, hydrochlorothiazide is sort of most popular (I am squinting at the figure), and in depression citalopram is also sort of most popular, but there's no clear winner. This is where I wonder about subpopulations. The immediate questions are: what's different about these patients? Why did they receive a different first medication? Does it vary by centre (yes - see figure 3)? By other diagnoses? Age? So many variables to consider! (I realise that this paper cannot answer all of these questions and I'm not criticising it - the results just inspire further research.)

Do patients stay on a single drug?

For diabetes, 29% of patients took only metformin. For hypertension, 6.44% took only lisinopril. For depression, 5.18% took only citalopram. Once again I wonder what this means. Was this medication especially effective for them, and if yes why? We see the potential for this large-scale observational data to shed light on differences in response to therapy that might be missed on the smaller-scale of a clinical trial. Maybe.

Unique treatment pathways?

Some patients are unique in the entire dataset: 10% of diabetes patients, 24% of hypertension patients, 11% of depression patients have unique treatment pathways. Clearly doing a nearest-neighbour treatment recommendation approach would fail for these patients, although I wonder if these patients may simply have rather long sequences of medications? It might be in the supplemental data, but I wonder what the distribution of sequence length is.

Figure 3

This is figure 2 but broken down by data centre, for some samples. We see immediately that metformin is less popular in the Japanese database than in the UK or US examples shown. I think the overall gist of this figure is that there is between-centre heterogeneity, and also (as in Figure 2) heterogeneity in the choice of second-line drugs. You could definitely look deeper into this data (hence my feeling that this paper is a proof of concept), but there is a risk (as always) of wading around without a clear hypothesis.

Figure 4

The y-axis here is a fraction of patients in the population. The fraction of interest is given by the lettering. x-axis is time, so we're looking at medicating trends.

  • A: patients on monotherapy: this became somewhat more popular
  • B: patients on monotherapy which is the most popular monotherapy for that diesase: the medication is listed with the disease now (so this is a subset of the patients in A)
  • C: patients whose first medication started with the most popular starting medication for that disease (not necessarily most popular monotherapy)

The conclusion from B is that monotherapy in diabetes is somewhat dominated by metformin, whereas in hypertension and depression there is more variation.

I don't know how they decided which drug was most popular - is this over all patient trajectories over all time (I suspect yes)? It seems unlikely but the apparent absence of a dominant monotherapy in hypertension and depression could be explained by a strong bias towards some drugs being popular at some times: so at any moment in time there is a dominant monotherapy, but because its identity is always changing, it goes undetected by this analysis. Or more similarly, there is a dominant monotherapy, but it's not lisinopril/sertraline. Would this be an interesting finding? Perhaps. Discovering that medication practices are highly influenced by trends could be a cause for concern. Equally, finding that medication practices lag (between centres or behind research) could also be concerning. Or heartening. Who knows.

Figure 5

This is figure 4 but now the data series corespond to data centre, and the different diseases get their own graphs. They bind the y axes together across rows, so there are inset graphs to give the zoomed-in views. Mmm, data visualisation.

There's so much going on here that looking at this figure fills me with vague dread. We have the potential to learn how data centres vary in their medicating trends.

Gravitating towards the most extreme-looking data series, something is going on in STRIDE (US) for monotherapy. 100% of diabetes patients in 2004 were on metformin? This is also when this database appears to begin, so I guess something strange was going on (like only data from diabetes patients on metformin was being recorded, or something)...

The authors draw attention to the lack of consistent bias between use of EHR data and claims data in what they report. This is potentially very interesting, because claims data is somewhat more 'available' from what I can tell (people seem to be publishing more with claims data[citation required]), but is biased towards billing (obviously) and less 'rich' than a full EHR. Being able to use claims data as a proxy for EHR would be good and useful. However, the analyses here draw on medication information, which is probably well covered by claims data, so the finding is probably less striking.

Figure 6

Once again, we see a fraction of something on the y-axis, with time on the x-axis. In this case, it's the fraction of medication changes in that year which were within the same structural class (these classes are not fully listed in table 1, and are definitely in the supplemental information).

I am not sure what to conclude from this figure. Do different strutural classes correspond to very different mechanism of action for the drug? Would changing structural class mean the doctor believes the patient's disease to be characterised differently? I am not a doctor (as might be obvious) and I'm cancer-focused so I'm speculating wildly here. There isn't much discussion of this figure in the main paper. Not much of a trend is observed, anyway.


I reiterate my feeling that this is a proof of concept paper, or possibly a paper to advertise the seemingly incredibly amazing data resource OHDSI is creating. There aren't really any hypotheses tested in this work, and I don't come away from it with a strong conclusion beyond 'heterogeneity exists'. Then again, I came into this paper with little by way of prior expectation for the findings.

There are some further avenues of research (some of which I mentioned in this blog post) prompted by this study, but whether they're truly worth pursuing requires further thought, as ever. And I'm definitely going to check out what else OHDSI is up to.