Saltar a contenido

Guía 04 — Governance, Calidad y Contratos

Pilares

  1. Catálogo activo con ownership y descripciones de negocio.
  2. Linaje técnico y semántico end-to-end.
  3. Contratos versionados productor-consumidor.
  4. Calidad continua con observabilidad agentic.
  5. Compliance GDPR/CCPA/AI Act por diseño.

Catálogo: DataHub setup

Instalación

helm repo add datahub https://helm.datahubproject.io/
helm install datahub datahub/datahub \
  --set datahub-frontend.image.tag=v0.15.0 \
  --set datahub-gms.image.tag=v0.15.0

Conectores (Recipes)

  • Iceberg / Glue / Snowflake / BigQuery / Databricks
  • dbt (lineage automático)
  • Airflow / Dagster
  • Kafka (topics + schemas)
  • BI: Power BI, Looker, Tableau

Convenciones

  • Owners obligatorios en tablas Gold (campo owners).
  • Glosario de negocio mapeado a columnas (tag glossary:churn_rate).
  • PII tags asignados en CI (tag PII, PII.email, PII.phone).
  • Tier classification: P0 (crítico negocio), P1 (importante), P2 (analítico), P3 (exploratorio).

Data Contracts

Spec base (YAML)

# governance/data-contracts/customer_dim.yml
schemaVersion: v0.1.0
contract:
  id: gold.dim_customer
  version: 2.1.0
  owner: customer-platform
  consumers:
    - team: marketing-ops
      use_case: lifecycle campaigns
    - team: analytics
      use_case: BI dashboards
  description: |
    Conformed customer dimension. SCD2 with surrogate key.
    Updated hourly from CDC pipeline.

  schema:
    - name: customer_sk
      type: bigint
      constraints: [unique, not_null]
      description: Surrogate key (SCD2)
    - name: customer_id
      type: string
      constraints: [not_null]
      pii: false
    - name: email
      type: string
      constraints: [not_null]
      pii: true
      masking: hash_sha256_for_non_pii_consumers
    - name: segment
      type: string
      accepted_values: [SMB, MidMarket, Enterprise]
    - name: lifetime_value
      type: decimal(18,2)
      constraints: [non_negative]
    - name: valid_from
      type: timestamp
      constraints: [not_null]
    - name: valid_to
      type: timestamp
    - name: is_current
      type: boolean

  sla:
    freshness: 1h
    completeness: 99.9
    accuracy: 99.0
    availability: 99.9

  quality_tests:
    - name: customer_id_unique_when_current
      sql: |
        SELECT customer_id, COUNT(*) c
        FROM {{ this }}
        WHERE is_current = TRUE
        GROUP BY 1 HAVING c > 1
      severity: critical

  breaking_change_policy:
    type: explicit_pr_approval
    approvers: [marketing-ops, analytics]

  change_log:
    - version: 2.1.0
      date: 2026-04-15
      change: Added lifetime_value column
    - version: 2.0.0
      date: 2026-01-01
      change: Migrated to SCD2 with surrogate key

Enforcement en CI

# .github/workflows/contracts.yml
name: Data Contracts Check
on: [pull_request]
jobs:
  contract-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: datacontract/cli-action@v1
        with:
          command: lint governance/data-contracts/
      - name: Schema diff
        run: |
          python scripts/schema_diff.py \
            --contracts governance/data-contracts/ \
            --target prod
      - name: Block breaking changes without approval
        run: |
          python scripts/check_breaking_changes.py

Calidad: Soda

Setup

pip install soda-core-trino soda-core-spark

Checks (soda_checks/customer_dim.yml)

checks for gold.dim_customer:
  - row_count > 0
  - missing_count(email) = 0
  - duplicate_count(customer_id) = 0:
      filter: is_current = TRUE
  - freshness(valid_from) < 1h
  - schema:
      name: customer_dim_schema
      fail:
        when required column missing: [customer_id, email, segment]
        when forbidden column present: [ssn, credit_card]
  - values in (segment) must be in [SMB, MidMarket, Enterprise]
  - invalid_count(lifetime_value) = 0:
      valid min: 0
  - anomaly score for row_count < default

Ejecución

soda scan -d trino -c soda/configuration.yml soda_checks/

Integración con dbt

# dbt_project.yml
on-run-end:
  - "{{ soda_check('soda_checks/') }}"

Great Expectations (para datasets críticos)

import great_expectations as gx

context = gx.get_context()
datasource = context.sources.add_pandas("orders")
suite = context.add_expectation_suite("orders_suite")

suite.add_expectation({
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {"column": "order_id"}
})
suite.add_expectation({
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {"column": "order_id"}
})
suite.add_expectation({
    "expectation_type": "expect_column_pair_values_a_to_be_greater_than_b",
    "kwargs": {"column_A": "amount", "column_B": 0}
})

Observability — 5 pilares Gartner

  1. Data quality: freshness, volume, schema, distribution, custom rules.
  2. Pipelines: runs, failures, duration anomalies.
  3. Infra/compute: Spark stage failures, OOM, query latencies.
  4. Cost: $ por query, por dataset, por dominio.
  5. Usage: queries por usuario, deprecation candidates, hot paths.

Monte Carlo / Soda Cloud / Acceldata

  • Detectan anomalías sin reglas explícitas (ML baseline).
  • Notebooks de root cause.
  • Lineage 5 niveles deep.
  • Agentes IA (2025-2026): Sifflet Sentinel/Sage/Forge, Acceldata AI agents, Bigeye AI Trust Platform.

Linaje con OpenLineage

# Spark + OpenLineage
from openlineage.client import OpenLineageClient

client = OpenLineageClient.from_environment()
# Auto-instrumented via openlineage-spark-listener

Privacidad y compliance

PII tagging automático

  • DataHub + Spark scanner detecta patrones (email regex, SSN, etc.).
  • Tag aplicado en CI → propagación a Snowflake/UC/BigQuery masking policies.

Snowflake dynamic masking

CREATE OR REPLACE MASKING POLICY pii_email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('ANALYTICS_PII', 'COMPLIANCE') THEN val
    ELSE SHA2(val, 256)
  END;

ALTER TABLE gold.dim_customer
  MODIFY COLUMN email SET MASKING POLICY pii_email_mask;

Databricks UC row filter + column mask

CREATE FUNCTION mask_email(email STRING) RETURNS STRING
  RETURN CASE
    WHEN is_account_group_member('compliance') THEN email
    ELSE sha2(email, 256)
  END;

ALTER TABLE gold.dim_customer
  ALTER COLUMN email SET MASK mask_email;

Retention

-- Iceberg expire snapshots viejos
CALL system.expire_snapshots(
  table => 'silver.events',
  older_than => current_timestamp() - INTERVAL '365' DAY,
  retain_last => 30
);

GDPR right-to-be-forgotten

-- DELETE en Iceberg = merge-on-read deletion vector
DELETE FROM silver.users WHERE user_id = '<DSR-001>';
-- Snapshot auto-creado; rollback posible

Métricas governance

  • % tablas Gold con owner asignado: 100% (gate Fase 2).
  • % tablas Gold con contrato versionado: 100% (gate Fase 2).
  • MTTD incidente quality: <15min.
  • MTTR P0: <2h.
  • Adopción catálogo (usuarios activos semanales): >70% analistas.