The fault is not in our stars, but in ourselves

Update

CMS has acknowledged most of the problems described below. See here for more details.

It didn’t make much of a splash, but last week the American Hospital Association (AHA) wrote a letter to the U.S. Centers for Medicare and Medicaid Services (CMS) that complained about, among other things, CMS’s Overall Hospital Ratings program:

While we continue to be concerned that CMS's [Overall Hospital Rating] methodology is flawed, our concern is amplified by the fact that further analysis performed since the star ratings were first released show that substantive errors were made in executing CMS's chosen methodology.

(emphasis mine)

AHA makes a pretty big claim here, but doesn't elaborate on it - what are the substantive errors? I sort of contributed to finding one of them, so perhaps I can explain...

TL;DR

If you want the short version, here it is:

    • There’s a CMS initiative to grade hospitals on quality called the “Overall Hospital Rating.”
    • To assign a grade to a hospital, they combine a bunch of measures into a composite score, then group the scores from “1 star” to “5 stars.”
    • The combination procedure is really complicated and kind of crazy.
    • Even if you accept the combination procedure, CMS’s partners screwed up the computer program that implements it.
    • The score grouping part is also kind of crazy.
    • Even if you accept the the grouping procedure, CMS’s partners screwed up the the computer program that implements it, too.

The results of the mistakes are pretty significant, and if the ratings program winds up being tied to hospital reimbursement there will be pretty big financial consequences. I have an opinion on what should be done about this; skip to the bottom if you want to see it.

I.

Grading hospitals by quality seems like a pretty good goal. The health care system in the U.S. is really confusing, and people don’t really get a lot of practice at interacting with it until they really need to. I understand the desire to summarize lots of data about hospitals into something a consumer could reasonably understand.

However, if I got handed the task “take all this data we have about hospital performance and turn it into a 5 point scale,” my preferred answer would be: “no.” Suppose hospital A has a higher mortality rate for a certain procedure than hospital B. It might be because hospital A is a worse hospital, but it’s easy to think up alternate explanations.

For example: Suppose Hospital B is actually a bad hospital, and they send all their difficult cases to Hospital A. It would be perverse to give Hospital B a better ranking in this case; it would give Hospital A an incentive to stop taking difficult cases.

Or: Imagine that Hospital A is really good, but they’re located next door to a nursing home. Hospital B is not as good, but they’re located next to a college campus. Giving Hospital A a bad quality rating is punishing them for happenstance.

Let’s say I really have to produce a rating, though. What would I do? I would probably:

    • Find some experts and ask them to assign weights to my various measures on the basis of how much they contribute to quality
    • Compute a per-hospital score based on the weights and each hospital’s measure
    • Give the bottom 20% by score one star, the next 20% two stars, and so on

I’d probably massage the data in various ways, adding complications for outliers and missing data and such, but that’s the broad outline.

It wouldn’t be good, but I could explain it, and I could distribute it on an Excel spreadsheet. Hospitals would be able to tell how much each measure affected their overall rating, and could target improvements to areas where they’re lacking. (Also, I could blame the experts for assigning the wrong weights if it doesn’t correlate well with other quality measures.)

Is this what CMS did? Uh, not really.

II.

One problem with my simple quality model above is that it’s just encoding human judgment. The experts assigning weights to the measures would probably just be making up numbers, because determining the relationship between something like “Patients who reported that the area around their room was quiet at night” and overall quality isn’t really what they’re trained in, because no one is trained in that.

CMS seems to have wanted a more objective method for combining the measures into a score, so they developed a “Latent Variable Model” (LVM). The intuition here is that each of the measures tells us something about the not-directly-measurable thing we call “quality.” If we see that a bunch of the measures correlate with each other (hospitals that have quiet rooms also tend to have low readmission rates, or something), we might be glimpsing the influence of some underlying quality.

This sounds like a pretty good idea, and it might be better than my straightforward method above! CMS’s methodology report gives this description of the model they developed:

Yeah, this has some clarity and labeling issues. But at least it’s objective, right?

Well… it turns out that CMS didn’t make one latent variable model, they made seven. They took each of their 57 measures and put them into groups (Mortality, Readmission, Patient Experience, etc.) Then they made an LVM for each group and combined the results using… arbitrary weights picked by experts:

I really love this table. I can see just how it went - there were originally four categories, and the experts said, “Eh, let’s give them equal weight.” And then somebody said, “What about Timeliness of Care?” And so they subtracted a point from the other categories and gave them to Timeliness of Care. And then that repeated a few times.

(I also love that most of the measures are 5.5x more important than “Effectiveness of Care”)

III.

Let’s say we accept the limitations of the LVM models and their arbitrary combination. At least the method is written down and there’s a solid way to map input measures to output scores, right?

Uh, sort of. It looks like the equations from the methodology report are really translations of what a SAS Program that uses the NLMIXED procedure does. It’s not clear whether the equations or program came first, but the program is what really does the work. Here is the part of the relevant code:

What’s the big deal here? There are a few issues, but one I’ll focus on is the qpoints=30 bit in the screenshot above. What’s this?

As SAS’s documentation shows, the “quadrature points” parameter is used in approximating the answers to difficult-to-evaluate integrals. You can use fewer points to get a less accurate answer more quickly, or more points to get a more accurate answer more slowly. CMS’s partners used 30, apparently because that gets the runtime of the program down to between 24 and 100 hours. (Did I mention that calculating star ratings takes 100 hours?)

Is this a problem? We’re already dealing with numbers in this weird arbitrary space - does it matter if we use a more or less precise approximation when calculating them? Unfortunately yes: it seems hundreds of hospitals get different ratings if more quadrature points are used.

So now not only is our objective model of quality ruined by subjective human judgment, it takes forever to run, and is super unstable. (It so happens that there is a fast procedure to solve the relevant integrals exactly, but I’ll not dwell on that here.)

IV.

So the computation of a complicated model is complicated. Grouping its output into five buckets should be pretty straightforward, right?

Hahaha, no. We wouldn’t want to put our arbitrarily weighted numbers through an arbitrary grouping procedure like percentiles - what if the cutoff was in a weird spot in bizarro clown space?

One important aspect of k-means clustering is “convergence.” The algorithm is computationally difficult, but you can run it iteratively until the results stop changing to get the groupings you need. If you don’t do this you’ll get groupings that are, for example, very dependent on the ordering of the data points.

The CMS partners messed this up in the SAS program. They run the SAS FASTCLUS procedure for 5 buckets, but leave all other parameters at their defaults:

The result is one iteration that doesn’t converge (see the SAS documentation), meaning the final star ratings are unstable. They also differ depending on how you sort the data points. And again, hundreds of hospitals get different ratings if you do things correctly.

(I'll leave aside the fact that the summary scores are smoothly distributed by construction and there are no natural breaks and that this is not at all an appropriate procedure.)

Conclusion

I could talk about how some of the individual measures of quality are negatively associated with star ratings, meaning a rational hospital should do worse to get a better star rating. Or I could talk about how hospitals don’t really get any actionable feedback on what things they should do better on to get a better rating. But I’ll stop here.

What should be done about all this? I think the answer is not “fix the problems with the SAS program.” I think it’s “either do something simple or don’t do this at all.” As far as I know money isn’t tied to these ratings yet, but if it is billions of dollars (and potentially people’s lives!) could be affected by this bad method.

My goal here isn’t to make fun of CMS, or their partners (Yale and the Lantana Group) that perpetrated this. It’s to call attention to the problems here so they don’t wind up having big impacts. (Any fun-making is incidental.)

I think the AHA and its members don’t want to be evaluated on quality, because… who would? They are a big lobbying group aiming to protect their constituents from having to improve. But they’re right to demand that this rating program be withdrawn.

If you are at CMS or with one of their partners and want to talk to me about this, send me a note. Or better yet, skip writing to me and work on fixing this!

Q&A

Q. Who are you?

A. Just some guy who happened to fall into this rabbit hole.


Q. Are you a hospital industry shill?

A. I don’t think so. I’ve never worked in health care. My job is computer network security stuff. None of the above represents my employer’s views.

Q. What’s your interest in hospital rating?

A. None, really! I’ve thought about offering consulting services related to this stuff after exploring it so deeply, but haven’t done anything with that.

Q. Did you really figure all this stuff out?

A. No. I contributed to the quadrature points stuff, but the credit for the rest goes to mark-r-g and his co-workers at a hospital group. My views most emphatically do not represent theirs.