Skip to main content

Latent Variable Model

How it works

Step 1 — Measure what happened. For every diagnosis and procedure code (ICD-10 CM and PCS) that appears within an SSP, we look at all the claims where that code was billed. We then calculate 14 signals of resource use: ICU admission rate, average length of stay, rate of mechanical ventilation, total charges, and more. Codes with fewer than 100 supporting claims are excluded to keep the statistics reliable.

Step 2 — Estimate Intensivity. We use a statistical method called a Latent Variable Model / PCA to distill all 14 signals into a single intensity score per code. Higher score means the encounters that carry that code were, on average, more resource-intensive.

Step 3 — Cut into tiers. Codes are grouped into tiers based on their intensity score. The number of tiers is chosen automatically to maximise how cleanly separated the groups are. Tier 1 is the least intensive; the highest tier is the most.

Step 4 — Validate. For diagnosis codes, we cross-check our tiers against the CMS severity labels (Major Complication, Complication, No Complication). Well-calibrated tiers should line up: tier 1 codes should be predominantly "No Complication" and the top tier predominantly "Major Complication."


Latent Intensivity Score

Note: PCA and Latent Variable Model are the same thing

After feature engineering produces 14 numeric signals per SSP × code, PCA compresses them into a single intensity score using PCA. The score is the coordinate of each code along the first principal component — the direction in 14-dimensional feature space that explains the most variance.

The 14 features are highly correlated: an encounter with a long LOS tends to also have ICU admission, higher charges, and more organ systems involved. Most of their shared information lives on a single latent axis. PCA finds that axis without any hand-tuned weights, and the first PC empirically aligns with overall resource intensity (LOS, ICU, ventilation, charges). A single score is then straightforward to threshold into tiers.


Step-by-step

2a — Standardize

Each feature is z-scored to zero mean and unit variance:

x~j=xjμjσj\tilde{x}_j = \frac{x_j - \mu_j}{\sigma_j}

This is necessary because features differ widely in scale — avg_length_of_stay is in days while rate_icu is a proportion — and PCA is sensitive to scale.

2b — PCA

PCA is fit on the full ncodes×14n_{\text{codes}} \times 14 standardized matrix. The first principal component w1R14\mathbf{w}_1 \in \mathbb{R}^{14} solves:

w1=argmaxw=1  wCw\mathbf{w}_1 = \underset{\|\mathbf{w}\|=1}{\arg\max} \; \mathbf{w}^{\top} \mathbf{C} \, \mathbf{w}

where C\mathbf{C} is the empirical covariance matrix.

Sign orientation. PCA does not guarantee the sign of the eigenvector. After fitting, if PC1 is negatively correlated with avg_length_of_stay, scores and loadings are multiplied by 1-1 so that higher intensity score = higher resource use.

Quality check. The proportion of variance explained by PC1 is reported in pc1_var. A value above ~40% indicates that the features share a strong common factor and a single score is a reasonable summary. The scree plot is included in each per-SSP report.

2c — Key drivers

Each intensity score decomposes exactly into per-feature contributions:

si=j=114wjx~ijcontributionjs_i = \sum_{j=1}^{14} \underbrace{w_j \cdot \tilde{x}_{ij}}_{\text{contribution}_j}

The top 3 features by contributionj|\text{contribution}_j| are stored in key_drivers as a human-readable string, e.g.:

"icu (+0.45), length of stay (+0.38), dialysis (+0.21)"

Positive values pushed the score up (more intensive); negative values pulled it down.


Interactive demo

The plot below illustrates the core idea using three representative RII features: rate_icu, avg_length_of_stay, and avg_organ_system_count. In the actual pipeline all 14 features are used.

  • 3-D panel — standardized feature space. The purple arrow is the PC1 axis; dashed lines drop each code perpendicularly onto it.
  • 1-D panel — the same codes projected onto PC1. Separation between the three simulated tiers confirms that PC1 recovers the latent intensity gradient.
3-D Feature Space  (drag to rotate)
LowModerateHighPC1 axisProjection onto PC1
Univariate Projection — PC1

All three features collapsed onto a single latent severity axis. Each dot is the same patient as in the 3-D plot, now positioned only by their PC1 score.

LowHighPC1 — Intensivity Score
Variance captured by PC1: 71.4%  (PC2 + PC3 share the remaining 28.6%)
PC1 Feature Loadings
FeatureLoading (w)Magnitude
rate_icu+0.4140
avg_length_of_stay+0.6489
avg_organ_system_count+0.6384

All loadings are positive, confirming PC1 is a single intensity axis. The real pipeline uses all 26 RII features; these three are shown for illustration.


Interpreting loadings

The loading wjw_j for feature jj shows how much a one-standard-deviation increase in that feature contributes to the intensity score. In well-behaved SSPs the dominant loadings are clinical intensity signals (LOS, ICU rate, mechanical ventilation, total charge).


Output

The intensity score for each code is stored in rii_code_tiers.intensity_score. Tier Assignment then applies 1-D KMeans to partition codes into discrete tiers on this scalar axis.