Regression-Based Intensivity Score

The Latent Variable approach finds the direction in feature space that explains the most variation across codes — with no explicit target in mind. The regression approach makes the target explicit: given everything we can observe about how a case unfolded clinically, how much would we expect it to cost?

Thirteen clinical signals (ICU admission rate, length of stay, ventilation rate, and so on) are used to predict the fourteenth — relative charge. That predicted charge becomes the intensity score. Because the score is derived from clinical activity rather than directly read off a bill, it is more robust to billing variation and provider-specific charge practices.

How it works

Step 1 — Same features The pipeline starts with the same 14 resource-use signals computed per code. For the regression, 13 of those signals are inputs; relative charge is set aside as the outcome we want to predict.

Step 2 — Fit and compare three models. Three model families are trained on the data for each SSP:

Model	Character
Ridge regression	Linear; shows how much each signal independently shifts predicted charge
Random Forest	Non-linear; learns interactions between signals automatically
XGBoost	Gradient boosting; often the most accurate on small, structured datasets

Step 3 — Pick the best model. The model with the highest average cross-validated R² is selected for that SSP.

Step 4 — Score every code. The chosen model predicts an expected charge for every code in the SSP. That predicted value is the intensity score: a clinically grounded estimate of how resource-intensive encounters carrying this code tend to be.

Step 5 — Cut into tiers. Identical to the PCA pipeline: 1-D KMeans groups codes into tiers by their predicted score, with the number of tiers chosen to maximise separation. Tier 1 is the least intensive; the highest tier is the most.

How it works​

On this page:

How it works