Skip to main content

Methodology

The sections below describe the business logic of each DAG stage. Every stage is an Airflow TaskGroup; task names map 1:1 to SQL files under dags/data_science/ssp/ssp_pricing/sql/<stage>/*.sql.


Stage 1 — pre_build

Produces the SSP "universe" tables that every downstream stage reads, plus sampled encounter volumes used for SSP-level popularity/revenue share.

ssp_groupings

The raw crosswalk (ssp_initial_pilot_codes_and_xwalk_v3) is enriched with RII-based tier multipliers. For inpatient SSPs whose sub_category is coded as -, each RII tier's average intensity score is pulled from rii_code_tiers and converted into a multiplier centered on 1.0.

How the spread is calibrated:

  • The DRG canonical-rate spread for that SSP (max_median / min_median across the DRGs in the group) determines the width of the multiplier range. Wider DRG spread → wider multipliers.
  • target_ratio = clamp(sqrt(drg_ratio), 1.2, 3.0) — the square root damps extremes.
  • Multipliers interpolate from 1/sqrt(target_ratio) up to sqrt(target_ratio) across the ordered tiers.

A multiplier of 1.0 means that tier is priced at the base weighted average; values above scale up, below scale down. Explicit (non--) subcategories are copied through untouched and get no multiplier.

rii_tier_mapping

Joins the per-tier multipliers back to the underlying (ssp, code, code_type) rows so any downstream task can attach an intensity-based multiplier to a specific DRG.

Average line-code units

Three SQL files build a short pipeline for per-line-code average units used per encounter:

  1. avg_line_code_units_inst_enc — Finds anchored institutional encounters (OP: komodo_op_max_pay_line highest-paying line = anchor CPT/HCPCS; IP: DRG match via ipps_grouper_drgs).
  2. avg_line_code_units_all_enc — Adds professional encounters that share a visit_id with an anchored institutional encounter. These become the PRO rows in the final table.
  3. avg_line_code_units — Joins every anchored encounter to medical_service_lines_both and averages units per (ssp_grouper, fee_type, billing_code). Used later to scale anesthesia time-based fees and some line-code professional rates.

Encounter volume (sampled)

ssp_encounter_matches_op and ssp_encounter_matches_ip each take a 1% sample of encounters (rand() < 0.01) and join to the relevant anchor codes; ssp_encounter_volume aggregates the union to produce per-SSP encounter_pct and revenue_pct (weighted by commercial canonical rate). This table feeds export_ssp_descriptions. The 1% sample is a cost optimization — the DAG doesn't need exact counts for relative share.


Stage 2 — claims

Derives line-code tables from Komodo claims data. Three outputs: manual_institutional_line_codes, manual_professional_line_codes, and the consolidated supplemented_sub_package_contents that downstream line-code tasks read.

Anchor encounters

Two tables define "encounters that belong to an SSP":

  • ssp_op_anchor_encounters — OP facility claims (mh.claim_pos = 'OP', statement_to = statement_from, discharge_status_code = '01') where the max-pay service line matches a CPT/HCPCS anchor code.
  • ssp_ip_anchor_encounters — IP facility claims (mh.claim_pos = 'IP') whose DRG (from ipps_grouper_drgs) matches an MS-DRG anchor code.

These are the full-population encounter matches (no 1% sample) and serve as the foundation for line-code discovery.

excluded_line_codes

A small global exclusion list that both discover_institutional and discover_professional filter out:

CodeDescription
36415Venipuncture, routine
36416Venipuncture, capillary
36410Venipuncture, age 3+
U0002COVID-19 test, non-CDC

These codes appear on nearly every claim regardless of procedure and would inflate line-code counts without adding pricing signal.

discover_institutionalmanual_institutional_line_codes

For each anchor code, finds other procedure/revenue codes co-occurring on the same institutional encounter. For each (anchor_code, line_code, item_type) triple the SQL computes:

  • encounter_count — distinct encounters carrying this line code.
  • average_charge_amount — encounter-weighted average of line_charge.
  • revenue_code_association_rate — share of line-code encounters carrying the modal revenue code. Low values (<0.3) signal the line code has unstable RC mapping; used by Stage 3 institutional_line_codes to drop claims-derived revenue codes that aren't reliable.
  • association_rate — encounter share relative to the max-frequency line code for that anchor.

Inclusion rules (is_included = TRUE):

  • association_rate > encounter_threshold (default 0.3)
  • Anchor had ≥ 100 encounters
  • Line code is not an anchor of a non-I&S SSP (self-match is allowed — I&S anchor codes are legitimately shared across SSPs as ancillary line codes)
  • Line code not in excluded_line_codes
  • Line code does not end in F (CPT Category II tracking codes, non-billable)

Combo-SSP line codes (43250 + 45384, 43251 + 45385) are hardcoded at the end via a VALUES block.

discover_professionalmanual_professional_line_codes

Same shape as institutional but sourced from professional claims sharing a visit_id with an anchored institutional encounter. Covers both OP (CPT/HCPCS) and IP (MS-DRG) anchors — the MS-DRG branch joins on visit_id from ssp_ip_anchor_encounters so inpatient professional line codes (e.g. 99223) are discovered the same way. Inclusion rules mirror the institutional side (no Medicare-rate floor; non-I&S anchors only allowed on their own SSP; exclusions and F-suffix filter).

Three sources are combined via BOOL_OR(is_included) so any TRUE source wins:

  1. Claims-derived filter output.
  2. Combo-SSP professional codes (hardcoded VALUES block).
  3. I&S ancillary augmentation. For every (base_code, line_code) in manual_institutional_line_codes where the line code is an I&S SSP anchor (e.g. a lab or radiology code), a synthetic is_included = TRUE professional row is added — even if the claims-derived row had is_included = FALSE.

combined_line_codessupplemented_sub_package_contents

Consolidates the two manual tables + average units into a single table on grain (ssp_grouper, base_code, line_code, fee_type). Two business-logic filters are applied on top of the Stage 2 is_included flag:

  1. Anchors must be professionally billed. CPT/HCPCS anchors only survive if the anchor code itself appears as an included professional line code (base_code = line_code on the pro side). Ancillary pro codes (anesthesia, radiology, placeholder unlisted codes like 53899) alone are not enough — the procedure itself must be professionally billed for the SSP to have a usable professional fee. MS-DRG anchors are exempt because the DRG never appears on a professional claim directly.
  2. I&S non-anchor lines are dropped. For I&S-category SSPs, only the anchor itself survives (line_code = base_code). I&S SSPs price individual services, so ancillary line codes don't apply regardless of association.

Other merge rules:

  • Institutional rows carry fee_type = 'Facility Fee'.
  • Professional rows get fee_type = 'Professional Fee' unless the line code is classified as Anesthesia, Lab/Path, or Radiology — those get 'Optional Fee'. Anesthesia classification here is gated on anesthesia_reference_pricing only; cats.category = 'Anesthesia' fall-through is suppressed via NULLIF so the spines category can't accidentally upgrade a non-reference code into Anesthesia.
  • average_units falls back to avg_line_code_units when sub_package_contents doesn't carry one; defaults to 1.
  • Dedup keeps the Professional Fee variant over Optional Fee when both exist for the same line code, and prefers rows with non-null revenue_code.
  • revenue_code_association_rate from Stage 2 is threaded through so Stage 3 can gate on it.

rc_family_proportionsrevenue_code_family_proportions

Computes association_rate and avg_line_charge per (ssp_grouper, revenue_code_family) using the anchor-encounter tables. association_rate is capped at 1.0 and is normalized against the anchor code's own revenue-code family (so the anchor family has association rate 1.0 and every other family is relative to it).

Two averaging rules are applied to avg_line_charge:

  • Per-encounter services (Gastrointestinal Services, Operating Room Services, Emergency Room, Clinic) — total_charge / encounter_count. These are typically billed once per encounter, so dividing by encounters that don't carry the family would dilute the signal.
  • Everything elsetotal_charge / total_encounters. Lab/pharmacy etc. may repeat within an encounter, so a per-total-encounter average is more stable.

Also maps 0360-0369 (Operating Room) to 0450-0459 (Emergency Room) for SSPs whose name contains ns_ (non-surgical). See Manual Edits for the business justification.


Stage 3 — line_codes

Enriches the raw line-code tables from Stage 2 into the price-ready tables the fee-schedule stage consumes.

institutional_line_codes

One row per (ssp_grouper, sub_category, pos, base_code, line_code, revenue_code_family).

Adds three pieces of metadata:

  1. label — identifies the role of the line code:
    • Anchor Code — the SSP's own anchor.
    • Revenue Code — a revenue-code-type line (comes directly from sub_package_contents).
    • Carved Out: Drug — line on drug revenue code 0636 that is also in asp_reference_pricing with avg_line_charge > $1,000.
    • Carved Out: Implant — line on an implant revenue code (0275 pacemakers or 0278 other implants) with a device code (C / Q / L prefix) from ref_aapc_cpthierarchy.
  2. Revenue-code family resolution — uses (in order): an override in rc_overrides (currently 44390 → 0750, 21820 → 0450); the claims-derived revenue code from supplemented_sub_package_contents only when its revenue_code_association_rate > 0.30 (otherwise dropped so the fallback takes over); the revenue_codes dictionary match on the specific code; a direct revenue-code match; an HCPCS→RC crosswalk fallback. For ns_ SSPs, anchor-code revenue families in Operating Room are rewritten to Emergency Room.
  3. rc_family_avg_line_charge — looked up from revenue_code_family_proportions, used later by Stage 4 RC-family allocation.

ssp_line_code_service_types

Classifies every candidate professional line code into one of four service types via a priority waterfall:

  1. Code is in anesthesia_reference_pricing (and not in anesthesia_exclusion_codes) → Anesthesia.
  2. Code is in clinical_laboratory_reference_pricingLab/Path.
  3. Code is in labpath_radiology_codes (manual list) → Lab/Path.
  4. services_spines_cleaned.ccs_category in the Lab/Path list → Lab/Path.
  5. ccs_category in the Radiology list (or LIKE '%CT scan%', '%ultrasound%', '%radiology%', '%X-ray%') → Radiology.
  6. Default — spines_services_clinical_categories.category, except 'Anesthesia' is suppressed to NULL. Only anesthesia_reference_pricing is trusted to classify a code as Anesthesia; ccs_category = 'Anesthesia' and cats.category = 'Anesthesia' fall-throughs are intentionally disabled because they over-flagged non-anesthesia procedural codes.

ancillary_encounter_proportions

Computes the share of encounters where each ancillary service type (Anesthesia, Lab/Path, Radiology) actually appears for an SSP. It builds one row per anchored visit_id (IP or OP), joins professional claims on that visit, and divides the encounter count for each service type by the SSP's max-service-type encounter count. Used by professional_line_codes to dampen the impact of ancillary codes that only appear on a minority of encounters.

professional_line_codes

One row per (ssp_grouper, sub_category, pos, base_code, line_code, fee_type, provider_id).

For each line code the table carries:

  • Commercial rate from national_benchmarks (Clear Rates PPO canonical rate for Physician Group / Professional bill type), multiplied by the line code's units (anesthesia time-based codes get units = max(avg_units/15, 1), others clamp to 1).
  • Medicare rate from physician_reference_pricing by state; falls back to anesthesia_reference_pricing by state, then national CLFS for lab codes. A national median row is also appended with provider_id = '0' so providers without a state-specific rate still get a fallback.
  • Service type from ssp_line_code_service_types (or Professional default).
  • Gross charge = commercial rate × markup ratio (if the SSP has one).

Rows are filtered to association_rate IS NULL OR association_rate > 0.3, which keeps sub_package_contents-derived codes (no association rate computed) while dropping rare claims-derived codes.

professional_line_code_ncci_groups

The only Python-in-the-DAG task. Pure SQL can't do connected-component clustering, so this task:

  1. Loads the distinct (ssp_grouper, line_code, service_type) tuples from professional_line_codes into pandas.
  2. Loads CMS NCCI procedure-to-procedure edits (tq_production.reference_legacy.ref_cms_ncci_ptp_edits_practitioner_v2) filtered to mutually exclusive / "more extensive" / preparation-for-anesthesia rationales.
  3. Adds hardcoded mutually exclusive pairs (_MANUAL_ME_PAIRS) — see Manual Edits.
  4. Runs a Union-Find (a.k.a. disjoint-set) clustering per (ssp_grouper, service_type). Codes in the same component are mutually exclusive and form one NCCI group.
  5. Writes the resulting (ssp_grouper, service_type, ncci_group, line_code, group_size) rows back to Trino via batched INSERTs, then runs ncci_groups_append.sql to merge ncci_group + ncci_group_size into professional_line_codes.

Stage 4 — fee_schedules

Produces the SSP- and subcategory-level fee schedules, the revenue-code-family allocation, and the combined output tables.

institutional_fee_schedule

For each (ssp_grouper, sub_category, pos, provider_id) the institutional price is a volume-weighted average of commercial canonical rates across codes, scaled by the tier multiplier from Stage 1:

subcategory_price = multiplier × Σ (rate_i × volume_i) / Σ volume_i
  • Rates come from the national_benchmarks CTE (Clear Rates prod_combined_abridged, PPO networks, Hospital provider type, validated-first, fallback to backup).
  • Volume is encounter volume from the acute_care_large volume table joined to prod_combined_abridged (defaults to 1).
  • Each rate carries a gross_charge_ratio (validated → backup → 1) derived from %_to_dol canonical rates that include a gross charge, yielding subcategory_gross_charge alongside the commercial price.

Medicare rates use the medicare CTE (IPPS + OPPS reference pricing, per-provider where available, falling back to a national median) and are picked with MAX_BY(medicare_rate, volume) so the highest-volume code wins rather than an average.

Subcategory prices are then rolled up to the SSP level using the same volume-weighted average:

ssp_grouper_price = Σ (subcategory_price_j × volume_j) / Σ volume_j

institutional_rc_family_allocation

Splits the institutional price across revenue-code families using the proportions from Stage 2 (revenue_code_family_proportions).

  • Carve-out families (Carved Out: Drug, Carved Out: Implant) are priced independently at rc_family_avg_line_charge × 0.5. They are additive on top of the base price.
  • Base families share the institutional price in proportion to their rc_family_avg_line_charge (so proportions sum to 1.0 within an SSP).
  • Two gating rules decide whether a family qualifies as a carve-out:
    • Pharmacy (0630-0639) — carved out only if avg_line_charge > $2,000.
    • Med/Surg Supplies (0270-0279) — only added if the SSP actually has a Carved Out: Implant-labeled line.

The table emits both a charge_allocation (proportional share) and an allowed_allocation (share based on Komodo relative_association); the export stage uses both for different views.

professional_fee_schedule

NCCI-aware aggregation in three passes:

  1. Within each NCCI group (mutually exclusive codes) — volume-weighted average of commercial / Medicare rates. Picking one code from the group is the correct pricing model because only one can be billed on a real encounter.
  2. Across NCCI groups — rates are summed. Codes in different groups can be billed together.
  3. Across base codes — rolls to (ssp_grouper, sub_category, pos, provider_id) with a volume-weighted average, then up to (ssp_grouper, pos, provider_id).

The final table emits separate price columns per service type: subcategory_professional_price, subcategory_anesthesia_price, subcategory_labpath_price, subcategory_radiology_price — plus their Medicare + gross-charge companions.

combined_subcategory_fee_schedule + combined_ssp_fee_schedule

Joins institutional and professional fee schedules on (ssp_grouper, sub_category, pos, provider_id) and computes the final output columns.

Institutional:

inst_price = subcategory_price + Σ carveout_allocations

Professional components:

ColumnFormula
primary_pricesubcategory_professional_price
anes_pricesubcategory_anesthesia_price × 0.5 (anesthesiologist share)
crna_pricesubcategory_anesthesia_price × 0.5 (CRNA share)
assistant_surgeon_priceprimary_price × 0.16
assistant_nonsurgeon_priceprimary_price × 0.136
labpath_pricesubcategory_labpath_price
radiology_pricesubcategory_radiology_price

Total professional uses the full anesthesia fee (not 50%) so the anes + CRNA halves aren't double-counted:

prof_price = primary
+ anesthesia_full
+ labpath
+ radiology
+ primary × 0.16
+ primary × 0.136

combo_ssps task

Runs four SQL files in sequence:

  1. combo_ssps_delete.sql — removes any prior combo-SSP rows from combined_ssp_fee_schedule.
  2. combo_ssps_delete_subcategory.sql — same for the subcategory table.
  3. combo_ssps_insert.sql — joins the two source SSPs per provider_id and applies multiple-procedure logic:
    combo_price = 1.0 × primary + 0.5 × secondary
    Primary = whichever SSP has the higher inst_price + prof_price for that provider. Every price column is recomputed as the combo of A/B with those factors.
  4. combo_ssps_insert_subcategory.sql — same at the subcategory level (combos have a single sub_category = '0').

Combo SSPs use segment 2 in their name (e.g. GA.2.colonoscopy_and_egd) to distinguish them from standard (segment 0) SSPs.

Combo SSPSource ASource BPOS
GA.2.colonoscopy_and_egdGA.0.colonoscopyGA.0.egdOP

New combos require matching manual line-code entries — see Manual Edits.


Stage 5 — export

Eight versioned export_* tables package the combined fee schedules for downstream consumers (UI, QA, analytics). Each is a CREATE OR REPLACE built from combined_ssp_fee_schedule / combined_subcategory_fee_schedule plus a small crosswalk lookup.

TaskOutputPurpose
build_export_ssp_descriptionsexport_ssp_descriptions_v3SSP-level metadata with encounter-share and revenue-share from ssp_encounter_volume
build_export_ssp_pricesexport_ssp_prices_v3Provider × SSP totals including allowed_price variants from the RC-family allocation
build_export_facility_rc_familiesexport_facility_rc_families_v3Facility charges broken out per revenue-code family (useful for hospital rate build-up)
build_export_professional_line_itemsexport_professional_line_items_v3Per-line-code professional prices (granular view for QA)
build_export_professional_convenersexport_professional_conveners_v3Professional price rolled up by convener (primary / anes / labpath / radiology)
build_export_sub_categoryexport_sub_category_v3Subcategory-level prices per provider
build_export_all_line_itemsexport_all_line_items_v3Unified line-item view across institutional + professional
build_export_metadataexport_metadata_v3Key/value: pipeline_date, ssp_version

export_all_line_items reads from export_professional_line_items; export_metadata depends on all other export tasks completing (so row counts are stable).


Stage 6 — latest

Upserts each export_* artifact into a shared, generic-library table tagged by ssp_version. These latest_* tables are how downstream consumers read a consistent cross-version view — every pipeline run leaves behind the prior versions and adds/refreshes rows for its own version.

Each task uses the same pattern (_upsert_latest helper in tasks/export.py):

DELETE FROM {schema}.latest_{name} WHERE version = '{ssp_version}'
INSERT INTO {schema}.latest_{name} SELECT '{ssp_version}', ... FROM
{schema}.export_{name}_{table_version}

The DELETE + INSERT pair makes the upsert idempotent — re-running the pipeline for the same ssp_version replaces that version's rows without touching other versions.

TaskSourceDestination
build_latest_ssp_descriptionsexport_ssp_descriptions_v3latest_ssp_descriptions
build_latest_ssp_pricesexport_ssp_prices_v3latest_ssp_prices
build_latest_facility_rc_familiesexport_facility_rc_families_v3latest_facility_rc_families
build_latest_professional_line_itemsexport_professional_line_items_v3latest_professional_line_items
build_latest_professional_convenersexport_professional_conveners_v3latest_professional_conveners
build_latest_sub_categoryexport_sub_category_v3latest_sub_category
build_latest_all_line_itemsexport_all_line_items_v3latest_all_line_items
build_latest_metadatalatest_metadata

build_latest_metadata is special — it gathers row counts from each export_* into a key/value-style summary (ssp_descriptions_rows, ssp_prices_rows, …) plus ssp_version, pipeline_date, and export_date = current_date. The Airflow DAG gates it behind every other latest_* task so the row counts are stable.