Methodology
The sections below describe the business logic of each DAG stage. Every
stage is an Airflow TaskGroup; task names map 1:1 to SQL files under
dags/data_science/ssp/ssp_pricing/sql/<stage>/*.sql.
Stage 1 — pre_build
Produces the SSP "universe" tables that every downstream stage reads, plus sampled encounter volumes used for SSP-level popularity/revenue share.
ssp_groupings
The raw crosswalk (ssp_initial_pilot_codes_and_xwalk_v3) is enriched with
RII-based tier multipliers. For inpatient SSPs whose sub_category is coded
as -, each RII tier's average intensity score is pulled from
rii_code_tiers and converted into a multiplier centered on 1.0.
How the spread is calibrated:
- The DRG canonical-rate spread for that SSP (
max_median / min_medianacross the DRGs in the group) determines the width of the multiplier range. Wider DRG spread → wider multipliers. target_ratio = clamp(sqrt(drg_ratio), 1.2, 3.0)— the square root damps extremes.- Multipliers interpolate from
1/sqrt(target_ratio)up tosqrt(target_ratio)across the ordered tiers.
A multiplier of 1.0 means that tier is priced at the base weighted
average; values above scale up, below scale down. Explicit (non--)
subcategories are copied through untouched and get no multiplier.
rii_tier_mapping
Joins the per-tier multipliers back to the underlying
(ssp, code, code_type) rows so any downstream task can attach an
intensity-based multiplier to a specific DRG.
Average line-code units
Three SQL files build a short pipeline for per-line-code average units used per encounter:
avg_line_code_units_inst_enc— Finds anchored institutional encounters (OP:komodo_op_max_pay_linehighest-paying line = anchor CPT/HCPCS; IP: DRG match viaipps_grouper_drgs).avg_line_code_units_all_enc— Adds professional encounters that share avisit_idwith an anchored institutional encounter. These become thePROrows in the final table.avg_line_code_units— Joins every anchored encounter tomedical_service_lines_bothand averagesunitsper(ssp_grouper, fee_type, billing_code). Used later to scale anesthesia time-based fees and some line-code professional rates.
Encounter volume (sampled)
ssp_encounter_matches_op and ssp_encounter_matches_ip each take a 1%
sample of encounters (rand() < 0.01) and join to the relevant anchor
codes; ssp_encounter_volume aggregates the union to produce per-SSP
encounter_pct and revenue_pct (weighted by commercial canonical rate).
This table feeds export_ssp_descriptions. The 1% sample is a cost
optimization — the DAG doesn't need exact counts for relative share.
Stage 2 — claims
Derives line-code tables from Komodo claims data. Three outputs:
manual_institutional_line_codes, manual_professional_line_codes, and
the consolidated supplemented_sub_package_contents that downstream
line-code tasks read.
Anchor encounters
Two tables define "encounters that belong to an SSP":
ssp_op_anchor_encounters— OP facility claims (mh.claim_pos = 'OP',statement_to = statement_from,discharge_status_code = '01') where the max-pay service line matches a CPT/HCPCS anchor code.ssp_ip_anchor_encounters— IP facility claims (mh.claim_pos = 'IP') whose DRG (fromipps_grouper_drgs) matches an MS-DRG anchor code.
These are the full-population encounter matches (no 1% sample) and serve as the foundation for line-code discovery.
excluded_line_codes
A small global exclusion list that both discover_institutional and
discover_professional filter out:
| Code | Description |
|---|---|
| 36415 | Venipuncture, routine |
| 36416 | Venipuncture, capillary |
| 36410 | Venipuncture, age 3+ |
| U0002 | COVID-19 test, non-CDC |
These codes appear on nearly every claim regardless of procedure and would inflate line-code counts without adding pricing signal.
discover_institutional → manual_institutional_line_codes
For each anchor code, finds other procedure/revenue codes co-occurring
on the same institutional encounter. For each
(anchor_code, line_code, item_type) triple the SQL computes:
encounter_count— distinct encounters carrying this line code.average_charge_amount— encounter-weighted average ofline_charge.revenue_code_association_rate— share of line-code encounters carrying the modal revenue code. Low values (<0.3) signal the line code has unstable RC mapping; used by Stage 3institutional_line_codesto drop claims-derived revenue codes that aren't reliable.association_rate— encounter share relative to the max-frequency line code for that anchor.
Inclusion rules (is_included = TRUE):
association_rate > encounter_threshold(default 0.3)- Anchor had ≥ 100 encounters
- Line code is not an anchor of a non-I&S SSP (self-match is allowed — I&S anchor codes are legitimately shared across SSPs as ancillary line codes)
- Line code not in
excluded_line_codes - Line code does not end in
F(CPT Category II tracking codes, non-billable)
Combo-SSP line codes (43250 + 45384, 43251 + 45385) are hardcoded at
the end via a VALUES block.
discover_professional → manual_professional_line_codes
Same shape as institutional but sourced from professional claims sharing
a visit_id with an anchored institutional encounter. Covers both OP
(CPT/HCPCS) and IP (MS-DRG) anchors — the MS-DRG branch joins on
visit_id from ssp_ip_anchor_encounters so inpatient professional
line codes (e.g. 99223) are discovered the same way. Inclusion rules
mirror the institutional side (no Medicare-rate floor; non-I&S anchors
only allowed on their own SSP; exclusions and F-suffix filter).
Three sources are combined via BOOL_OR(is_included) so any TRUE
source wins:
- Claims-derived filter output.
- Combo-SSP professional codes (hardcoded
VALUESblock). - I&S ancillary augmentation. For every
(base_code, line_code)inmanual_institutional_line_codeswhere the line code is an I&S SSP anchor (e.g. a lab or radiology code), a syntheticis_included = TRUEprofessional row is added — even if the claims-derived row hadis_included = FALSE.
combined_line_codes → supplemented_sub_package_contents
Consolidates the two manual tables + average units into a single table on
grain (ssp_grouper, base_code, line_code, fee_type). Two
business-logic filters are applied on top of the Stage 2 is_included
flag:
- Anchors must be professionally billed. CPT/HCPCS anchors only
survive if the anchor code itself appears as an included
professional line code (
base_code = line_codeon the pro side). Ancillary pro codes (anesthesia, radiology, placeholder unlisted codes like53899) alone are not enough — the procedure itself must be professionally billed for the SSP to have a usable professional fee. MS-DRG anchors are exempt because the DRG never appears on a professional claim directly. - I&S non-anchor lines are dropped. For I&S-category SSPs, only the
anchor itself survives (
line_code = base_code). I&S SSPs price individual services, so ancillary line codes don't apply regardless of association.
Other merge rules:
- Institutional rows carry
fee_type = 'Facility Fee'. - Professional rows get
fee_type = 'Professional Fee'unless the line code is classified as Anesthesia, Lab/Path, or Radiology — those get'Optional Fee'. Anesthesia classification here is gated onanesthesia_reference_pricingonly;cats.category = 'Anesthesia'fall-through is suppressed viaNULLIFso the spines category can't accidentally upgrade a non-reference code into Anesthesia. average_unitsfalls back toavg_line_code_unitswhensub_package_contentsdoesn't carry one; defaults to 1.- Dedup keeps the
Professional Feevariant overOptional Feewhen both exist for the same line code, and prefers rows with non-nullrevenue_code. revenue_code_association_ratefrom Stage 2 is threaded through so Stage 3 can gate on it.
rc_family_proportions → revenue_code_family_proportions
Computes association_rate and avg_line_charge per
(ssp_grouper, revenue_code_family) using the anchor-encounter tables.
association_rate is capped at 1.0 and is normalized against the anchor
code's own revenue-code family (so the anchor family has association
rate 1.0 and every other family is relative to it).
Two averaging rules are applied to avg_line_charge:
- Per-encounter services (
Gastrointestinal Services,Operating Room Services,Emergency Room,Clinic) —total_charge / encounter_count. These are typically billed once per encounter, so dividing by encounters that don't carry the family would dilute the signal. - Everything else —
total_charge / total_encounters. Lab/pharmacy etc. may repeat within an encounter, so a per-total-encounter average is more stable.
Also maps 0360-0369 (Operating Room) to 0450-0459 (Emergency Room)
for SSPs whose name contains ns_ (non-surgical). See
Manual Edits for the business justification.
Stage 3 — line_codes
Enriches the raw line-code tables from Stage 2 into the price-ready tables the fee-schedule stage consumes.
institutional_line_codes
One row per (ssp_grouper, sub_category, pos, base_code, line_code, revenue_code_family).
Adds three pieces of metadata:
label— identifies the role of the line code:Anchor Code— the SSP's own anchor.Revenue Code— a revenue-code-type line (comes directly fromsub_package_contents).Carved Out: Drug— line on drug revenue code0636that is also inasp_reference_pricingwithavg_line_charge > $1,000.Carved Out: Implant— line on an implant revenue code (0275pacemakers or0278other implants) with a device code (C / Q / L prefix) fromref_aapc_cpthierarchy.
- Revenue-code family resolution — uses (in order): an override in
rc_overrides(currently44390 → 0750,21820 → 0450); the claims-derived revenue code fromsupplemented_sub_package_contentsonly when itsrevenue_code_association_rate > 0.30(otherwise dropped so the fallback takes over); therevenue_codesdictionary match on the specific code; a direct revenue-code match; an HCPCS→RC crosswalk fallback. Forns_SSPs, anchor-code revenue families in Operating Room are rewritten to Emergency Room. rc_family_avg_line_charge— looked up fromrevenue_code_family_proportions, used later by Stage 4 RC-family allocation.
ssp_line_code_service_types
Classifies every candidate professional line code into one of four service types via a priority waterfall:
- Code is in
anesthesia_reference_pricing(and not inanesthesia_exclusion_codes) → Anesthesia. - Code is in
clinical_laboratory_reference_pricing→ Lab/Path. - Code is in
labpath_radiology_codes(manual list) → Lab/Path. services_spines_cleaned.ccs_categoryin the Lab/Path list → Lab/Path.ccs_categoryin the Radiology list (orLIKE '%CT scan%','%ultrasound%','%radiology%','%X-ray%') → Radiology.- Default —
spines_services_clinical_categories.category, except'Anesthesia'is suppressed to NULL. Onlyanesthesia_reference_pricingis trusted to classify a code as Anesthesia;ccs_category = 'Anesthesia'andcats.category = 'Anesthesia'fall-throughs are intentionally disabled because they over-flagged non-anesthesia procedural codes.
ancillary_encounter_proportions
Computes the share of encounters where each ancillary service type
(Anesthesia, Lab/Path, Radiology) actually appears for an SSP. It builds
one row per anchored visit_id (IP or OP), joins professional claims on
that visit, and divides the encounter count for each service type by the
SSP's max-service-type encounter count. Used by
professional_line_codes to dampen the impact of ancillary codes that
only appear on a minority of encounters.
professional_line_codes
One row per
(ssp_grouper, sub_category, pos, base_code, line_code, fee_type, provider_id).
For each line code the table carries:
- Commercial rate from
national_benchmarks(Clear Rates PPO canonical rate for Physician Group / Professional bill type), multiplied by the line code'sunits(anesthesia time-based codes getunits = max(avg_units/15, 1), others clamp to 1). - Medicare rate from
physician_reference_pricingby state; falls back toanesthesia_reference_pricingby state, then national CLFS for lab codes. A national median row is also appended withprovider_id = '0'so providers without a state-specific rate still get a fallback. - Service type from
ssp_line_code_service_types(orProfessionaldefault). - Gross charge = commercial rate × markup ratio (if the SSP has one).
Rows are filtered to association_rate IS NULL OR association_rate > 0.3,
which keeps sub_package_contents-derived codes (no association rate
computed) while dropping rare claims-derived codes.
professional_line_code_ncci_groups
The only Python-in-the-DAG task. Pure SQL can't do connected-component clustering, so this task:
- Loads the distinct
(ssp_grouper, line_code, service_type)tuples fromprofessional_line_codesinto pandas. - Loads CMS NCCI procedure-to-procedure edits
(
tq_production.reference_legacy.ref_cms_ncci_ptp_edits_practitioner_v2) filtered to mutually exclusive / "more extensive" / preparation-for-anesthesia rationales. - Adds hardcoded mutually exclusive pairs (
_MANUAL_ME_PAIRS) — see Manual Edits. - Runs a Union-Find (a.k.a. disjoint-set) clustering per (ssp_grouper, service_type). Codes in the same component are mutually exclusive and form one NCCI group.
- Writes the resulting
(ssp_grouper, service_type, ncci_group, line_code, group_size)rows back to Trino via batched INSERTs, then runsncci_groups_append.sqlto mergencci_group+ncci_group_sizeintoprofessional_line_codes.
Stage 4 — fee_schedules
Produces the SSP- and subcategory-level fee schedules, the revenue-code-family allocation, and the combined output tables.
institutional_fee_schedule
For each (ssp_grouper, sub_category, pos, provider_id) the institutional
price is a volume-weighted average of commercial canonical rates across
codes, scaled by the tier multiplier from Stage 1:
subcategory_price = multiplier × Σ (rate_i × volume_i) / Σ volume_i
- Rates come from the
national_benchmarksCTE (Clear Ratesprod_combined_abridged, PPO networks, Hospital provider type, validated-first, fallback to backup). - Volume is encounter volume from the
acute_care_largevolume table joined toprod_combined_abridged(defaults to 1). - Each rate carries a
gross_charge_ratio(validated → backup → 1) derived from%_to_dolcanonical rates that include a gross charge, yieldingsubcategory_gross_chargealongside the commercial price.
Medicare rates use the medicare CTE (IPPS + OPPS reference pricing,
per-provider where available, falling back to a national median) and are
picked with MAX_BY(medicare_rate, volume) so the highest-volume code
wins rather than an average.
Subcategory prices are then rolled up to the SSP level using the same volume-weighted average:
ssp_grouper_price = Σ (subcategory_price_j × volume_j) / Σ volume_j
institutional_rc_family_allocation
Splits the institutional price across revenue-code families using the
proportions from Stage 2 (revenue_code_family_proportions).
- Carve-out families (
Carved Out: Drug,Carved Out: Implant) are priced independently atrc_family_avg_line_charge × 0.5. They are additive on top of the base price. - Base families share the institutional price in proportion to their
rc_family_avg_line_charge(so proportions sum to 1.0 within an SSP). - Two gating rules decide whether a family qualifies as a carve-out:
- Pharmacy (0630-0639) — carved out only if
avg_line_charge > $2,000. - Med/Surg Supplies (0270-0279) — only added if the SSP actually
has a
Carved Out: Implant-labeled line.
- Pharmacy (0630-0639) — carved out only if
The table emits both a charge_allocation (proportional share) and an
allowed_allocation (share based on Komodo relative_association); the
export stage uses both for different views.
professional_fee_schedule
NCCI-aware aggregation in three passes:
- Within each NCCI group (mutually exclusive codes) — volume-weighted average of commercial / Medicare rates. Picking one code from the group is the correct pricing model because only one can be billed on a real encounter.
- Across NCCI groups — rates are summed. Codes in different groups can be billed together.
- Across base codes — rolls to
(ssp_grouper, sub_category, pos, provider_id)with a volume-weighted average, then up to(ssp_grouper, pos, provider_id).
The final table emits separate price columns per service type:
subcategory_professional_price, subcategory_anesthesia_price,
subcategory_labpath_price, subcategory_radiology_price — plus their
Medicare + gross-charge companions.
combined_subcategory_fee_schedule + combined_ssp_fee_schedule
Joins institutional and professional fee schedules on
(ssp_grouper, sub_category, pos, provider_id) and computes the final
output columns.
Institutional:
inst_price = subcategory_price + Σ carveout_allocations
Professional components:
| Column | Formula |
|---|---|
primary_price | subcategory_professional_price |
anes_price | subcategory_anesthesia_price × 0.5 (anesthesiologist share) |
crna_price | subcategory_anesthesia_price × 0.5 (CRNA share) |
assistant_surgeon_price | primary_price × 0.16 |
assistant_nonsurgeon_price | primary_price × 0.136 |
labpath_price | subcategory_labpath_price |
radiology_price | subcategory_radiology_price |
Total professional uses the full anesthesia fee (not 50%) so the anes + CRNA halves aren't double-counted:
prof_price = primary
+ anesthesia_full
+ labpath
+ radiology
+ primary × 0.16
+ primary × 0.136
combo_ssps task
Runs four SQL files in sequence:
combo_ssps_delete.sql— removes any prior combo-SSP rows fromcombined_ssp_fee_schedule.combo_ssps_delete_subcategory.sql— same for the subcategory table.combo_ssps_insert.sql— joins the two source SSPs perprovider_idand applies multiple-procedure logic:Primary = whichever SSP has the highercombo_price = 1.0 × primary + 0.5 × secondaryinst_price + prof_pricefor that provider. Every price column is recomputed as the combo of A/B with those factors.combo_ssps_insert_subcategory.sql— same at the subcategory level (combos have a singlesub_category = '0').
Combo SSPs use segment 2 in their name
(e.g. GA.2.colonoscopy_and_egd) to distinguish them from standard
(segment 0) SSPs.
| Combo SSP | Source A | Source B | POS |
|---|---|---|---|
GA.2.colonoscopy_and_egd | GA.0.colonoscopy | GA.0.egd | OP |
New combos require matching manual line-code entries — see Manual Edits.
Stage 5 — export
Eight versioned export_* tables package the combined fee schedules for
downstream consumers (UI, QA, analytics). Each is a CREATE OR REPLACE
built from combined_ssp_fee_schedule / combined_subcategory_fee_schedule
plus a small crosswalk lookup.
| Task | Output | Purpose |
|---|---|---|
build_export_ssp_descriptions | export_ssp_descriptions_v3 | SSP-level metadata with encounter-share and revenue-share from ssp_encounter_volume |
build_export_ssp_prices | export_ssp_prices_v3 | Provider × SSP totals including allowed_price variants from the RC-family allocation |
build_export_facility_rc_families | export_facility_rc_families_v3 | Facility charges broken out per revenue-code family (useful for hospital rate build-up) |
build_export_professional_line_items | export_professional_line_items_v3 | Per-line-code professional prices (granular view for QA) |
build_export_professional_conveners | export_professional_conveners_v3 | Professional price rolled up by convener (primary / anes / labpath / radiology) |
build_export_sub_category | export_sub_category_v3 | Subcategory-level prices per provider |
build_export_all_line_items | export_all_line_items_v3 | Unified line-item view across institutional + professional |
build_export_metadata | export_metadata_v3 | Key/value: pipeline_date, ssp_version |
export_all_line_items reads from export_professional_line_items;
export_metadata depends on all other export tasks completing (so row
counts are stable).
Stage 6 — latest
Upserts each export_* artifact into a shared, generic-library table
tagged by ssp_version. These latest_* tables are how downstream
consumers read a consistent cross-version view — every pipeline run
leaves behind the prior versions and adds/refreshes rows for its own
version.
Each task uses the same pattern (_upsert_latest helper in
tasks/export.py):
DELETE FROM {schema}.latest_{name} WHERE version = '{ssp_version}'
INSERT INTO {schema}.latest_{name} SELECT '{ssp_version}', ... FROM
{schema}.export_{name}_{table_version}
The DELETE + INSERT pair makes the upsert idempotent — re-running
the pipeline for the same ssp_version replaces that version's rows
without touching other versions.
| Task | Source | Destination |
|---|---|---|
build_latest_ssp_descriptions | export_ssp_descriptions_v3 | latest_ssp_descriptions |
build_latest_ssp_prices | export_ssp_prices_v3 | latest_ssp_prices |
build_latest_facility_rc_families | export_facility_rc_families_v3 | latest_facility_rc_families |
build_latest_professional_line_items | export_professional_line_items_v3 | latest_professional_line_items |
build_latest_professional_conveners | export_professional_conveners_v3 | latest_professional_conveners |
build_latest_sub_category | export_sub_category_v3 | latest_sub_category |
build_latest_all_line_items | export_all_line_items_v3 | latest_all_line_items |
build_latest_metadata | — | latest_metadata |
build_latest_metadata is special — it gathers row counts from each
export_* into a key/value-style summary (ssp_descriptions_rows,
ssp_prices_rows, …) plus ssp_version, pipeline_date, and
export_date = current_date. The Airflow DAG gates it behind every
other latest_* task so the row counts are stable.