Skip to main content

Physician Groups DAG

DAG ID: spines_physician_groups

Schedule: Manual trigger only

Tags: turquoise-health, data-science, physician-groups, graph-db, provider-network

Source: airflow_dags/dags/data_science/spines/physician_groups

Overview

The spines_physician_groups DAG builds and refreshes the Physician Groups spines tables. It uses PECOS, Definitive, Komodo, and MRF data to construct a graph database of provider relationships, then derives health system affiliations and produces 6 output spines tables.

Pipeline Phases

The DAG runs sequentially through 6 phases. Most of the pipeline is straightforward SQL — the only complex part is the NetworkX graph analysis in Phase 3, which feeds one spines table (associated_health_systems). The other 5 spines tables are built directly from bridge/similarity tables and can be understood independently.

start_dag
→ Phase 1: Data Preparation ┐
→ Phase 2: HCP Similarity Analysis ├─ SQL-only
→ Phase 3: Analysis & Graph Building │ (NetworkX graph - see below)
→ Phase 4: Spines Table Creation │
→ Phase 5: QA Tests │
→ Phase 6: Promotion ┘
end_dag

Spines Pipeline (SQL-only)

These phases cover the bulk of the DAG. If you only need to understand how the spines tables are built, this section is self-contained — 5 of the 6 output tables are produced here with no dependency on the NetworkX graph.

Phase 1: Data Preparation

Prepares PECOS data and creates bridge tables that map relationships between entity types.

TaskDescriptionOutput Table
create_clean_provider_typesJoins PECOS data with manual provider type mappingspecos_data_clean_provider_types_{version}
create_physician_groupsLoads PECOS/Definitive PGs, filters to PACs with >5 reassignments. Excludes optometry, therapy & rehab, behavioral health, nurse/PAphysician_groups_{version}
create_tin_hcp_bridgeTIN ↔ HCP NPI bridge from MRF compressed_providerstin_hcp_bridge_{version}
create_pac_hcp_bridgePAC ↔ HCP NPI bridge from PECOS reassignmentspac_hcp_bridge_{version}
create_hco_hcp_bridgeHCO NPI ↔ HCP NPI bridge from Komodo claimshco_hcp_bridge_{version}
create_hs_hco_bridgeHealth System ↔ HCO NPI bridge from spines tableshs_hco_bridge_{version}

Phase 2: HCP Similarity Analysis

Runs 4 tasks in parallel, each computing Jaccard similarity, containment scores, and intersection counts between pairs of organizations based on shared HCP NPIs.

TaskEdge TypeOutput Table
create_pac_hco_similarityPAC ↔ HCOpac_hco_similarity_{version}
create_pac_tin_similarityPAC ↔ TINpac_tin_similarity_{version}
create_pac_pac_similarityPAC ↔ PACpac_pac_similarity_{version}
create_hco_tin_similarityHCO ↔ TINhco_tin_similarity_{version}

Edge filtering thresholds:

jaccard_threshold: 0.3
min_containment_threshold: 0.5
intersection_count: 50

An edge is kept if the intersection is ≥50 shared HCPs (or ≥25% of reassignments) AND either Jaccard > 0.3 or at least one containment > 0.5.

Phase 4: Spines Table Creation

Creates the 6 final output tables. 5 of 6 are pure SQL built from the bridge and similarity tables above:

TaskOutput TableInputs
create_spines_provider_physician_groupsspines_provider_physician_groupsphysician_groups, PECOS, Definitive, NUCC taxonomy
create_spines_associated_tinsspines_provider_physician_groups_associated_tinspac_tin_similarity, physician_groups
create_spines_associated_facilitiesspines_provider_physician_groups_associated_facilitiespac_hco_similarity, physician_groups, hospital spines
create_spines_associated_physiciansspines_provider_physician_groups_physicianspac_hcp_bridge, NPPES, NUCC taxonomy
create_spines_associated_npisspines_provider_physician_groups_additional_npisDefinitive PG tables, NPPES, spines provider

The remaining table depends on the NetworkX graph output:

TaskOutput TableInputs
create_spines_associated_health_systemsspines_provider_physician_groups_associated_health_systemspg_hs_affiliations (from Phase 3 — see NetworkX Graph Analysis below)

Phase 5: QA Tests

Runs 16 tests in parallel across 7 QA sub-groups:

  • spines_provider_physician_groups_qa — uniqueness on provider_id + row count checks
  • associated_tins_qa — uniqueness on (provider_id, tin), row count, containment/Jaccard/intersection rule validation
  • associated_health_systems_qa — uniqueness on (provider_id, hs_id), distinct count checks
  • spines_physicians_qa — uniqueness on (provider_id, hcp_npi), row count checks
  • additional_npis_qa — uniqueness on (provider_id, npi)
  • associated_facilities_qa — uniqueness on (provider_id, facility_npi)
  • general_qa — cross-table count validations

Phase 6: Promotion

Promotes all 6 spines tables from the dev schema to the production schema using CTAS statements.


NetworkX Graph Analysis (Phase 3)

This phase builds a NetworkX graph from the similarity tables (Phase 2) and traverses it to determine which health systems each physician group is affiliated with. Its sole downstream consumer is the associated_health_systems spines table.

TaskDescriptionOutput
create_pac_connectionsAggregates similarity tables; creates arrays of HCO neighbors, TIN neighbors, and union of HCPs per PACpac_connections_{version}
create_networkx_graphBuilds undirected graph with PAC, HCO, TIN, HS nodes and all edges. Saves to S3 as pickleS3: provider_network_graph_{version}.pkl
create_pac_to_hs_associationsTraverses graph for PAC→HCO→HS and PAC→TIN→HCO→HS paths. Classifies as "Single Affiliated" or "Multiple Affiliated"pac_to_hs_associations_{version}
create_final_outputMerges graph associations with manual HS assignments, PAC metadata, and HS name lookupspg_hs_affiliations_{version}

Affiliation logic: If the graph identifies a single health system → use it. If multiple → fall back to manual assignments → fall back to a fallback health system.

For more detail on the graph methodology, see PG Graph.

Configuration

Default DAG parameters:

ParameterDefaultDescription
jaccard_threshold0.3Jaccard similarity threshold for filtering edges
min_containment_threshold0.5Containment score threshold
intersection_count50Minimum HCP NPI overlap to include edge
sub_version2025_07Version suffix for table naming
schematq_dev.internal_dev_csong_graphWorking dev schema
prod_schematq_dev.internal_dev_csong_graphTarget promotion schema

External table config: Defined in config/table_config.yaml — maps CMS, Definitive, Spines, Komodo, and MRF source tables.

Data Lineage

PECOS Raw Data + Manual Provider Type Mappings

[clean_provider_types]

[physician_groups] ← Definitive

[Bridge Tables: PAC-HCP, HCO-HCP, TIN-HCP, HS-HCO]

[Similarity Tables: PAC-HCO, PAC-TIN, HCO-TIN, PAC-PAC]

├──→ 5 Spines Tables (SQL-only) → QA → Promotion

└──→ [NetworkX Graph] → S3

[PAC-HS Associations] ← Manual HS assignments

[pg_hs_affiliations]

associated_health_systems → QA → Promotion

Data Sources

SourceTables Used
CMS PECOSpecos_provider_revalidation, pecos_provider_revalidation_reassignment
Definitivephysician_groups_overview, physician_groups_locations, physician_groups_affiliations
Komodomedical_headers, medical_service_lines
MRFcompressed_providers (dynamically constructed per month)
Spinesspines_provider_hospitals, spines_provider_health_systems, and association tables