Guía 04 — Governance, Calidad y Contratos
Pilares
- Catálogo activo con ownership y descripciones de negocio.
- Linaje técnico y semántico end-to-end.
- Contratos versionados productor-consumidor.
- Calidad continua con observabilidad agentic.
- Compliance GDPR/CCPA/AI Act por diseño.
Catálogo: DataHub setup
Instalación
helm repo add datahub https://helm.datahubproject.io/
helm install datahub datahub/datahub \
--set datahub-frontend.image.tag=v0.15.0 \
--set datahub-gms.image.tag=v0.15.0
Conectores (Recipes)
- Iceberg / Glue / Snowflake / BigQuery / Databricks
- dbt (lineage automático)
- Airflow / Dagster
- Kafka (topics + schemas)
- BI: Power BI, Looker, Tableau
Convenciones
- Owners obligatorios en tablas Gold (campo
owners).
- Glosario de negocio mapeado a columnas (tag
glossary:churn_rate).
- PII tags asignados en CI (tag
PII, PII.email, PII.phone).
- Tier classification: P0 (crítico negocio), P1 (importante), P2 (analítico), P3 (exploratorio).
Data Contracts
Spec base (YAML)
# governance/data-contracts/customer_dim.yml
schemaVersion: v0.1.0
contract:
id: gold.dim_customer
version: 2.1.0
owner: customer-platform
consumers:
- team: marketing-ops
use_case: lifecycle campaigns
- team: analytics
use_case: BI dashboards
description: |
Conformed customer dimension. SCD2 with surrogate key.
Updated hourly from CDC pipeline.
schema:
- name: customer_sk
type: bigint
constraints: [unique, not_null]
description: Surrogate key (SCD2)
- name: customer_id
type: string
constraints: [not_null]
pii: false
- name: email
type: string
constraints: [not_null]
pii: true
masking: hash_sha256_for_non_pii_consumers
- name: segment
type: string
accepted_values: [SMB, MidMarket, Enterprise]
- name: lifetime_value
type: decimal(18,2)
constraints: [non_negative]
- name: valid_from
type: timestamp
constraints: [not_null]
- name: valid_to
type: timestamp
- name: is_current
type: boolean
sla:
freshness: 1h
completeness: 99.9
accuracy: 99.0
availability: 99.9
quality_tests:
- name: customer_id_unique_when_current
sql: |
SELECT customer_id, COUNT(*) c
FROM {{ this }}
WHERE is_current = TRUE
GROUP BY 1 HAVING c > 1
severity: critical
breaking_change_policy:
type: explicit_pr_approval
approvers: [marketing-ops, analytics]
change_log:
- version: 2.1.0
date: 2026-04-15
change: Added lifetime_value column
- version: 2.0.0
date: 2026-01-01
change: Migrated to SCD2 with surrogate key
Enforcement en CI
# .github/workflows/contracts.yml
name: Data Contracts Check
on: [pull_request]
jobs:
contract-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: datacontract/cli-action@v1
with:
command: lint governance/data-contracts/
- name: Schema diff
run: |
python scripts/schema_diff.py \
--contracts governance/data-contracts/ \
--target prod
- name: Block breaking changes without approval
run: |
python scripts/check_breaking_changes.py
Calidad: Soda
Setup
pip install soda-core-trino soda-core-spark
Checks (soda_checks/customer_dim.yml)
checks for gold.dim_customer:
- row_count > 0
- missing_count(email) = 0
- duplicate_count(customer_id) = 0:
filter: is_current = TRUE
- freshness(valid_from) < 1h
- schema:
name: customer_dim_schema
fail:
when required column missing: [customer_id, email, segment]
when forbidden column present: [ssn, credit_card]
- values in (segment) must be in [SMB, MidMarket, Enterprise]
- invalid_count(lifetime_value) = 0:
valid min: 0
- anomaly score for row_count < default
Ejecución
soda scan -d trino -c soda/configuration.yml soda_checks/
Integración con dbt
# dbt_project.yml
on-run-end:
- "{{ soda_check('soda_checks/') }}"
Great Expectations (para datasets críticos)
import great_expectations as gx
context = gx.get_context()
datasource = context.sources.add_pandas("orders")
suite = context.add_expectation_suite("orders_suite")
suite.add_expectation({
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "order_id"}
})
suite.add_expectation({
"expectation_type": "expect_column_values_to_be_unique",
"kwargs": {"column": "order_id"}
})
suite.add_expectation({
"expectation_type": "expect_column_pair_values_a_to_be_greater_than_b",
"kwargs": {"column_A": "amount", "column_B": 0}
})
Observability — 5 pilares Gartner
- Data quality: freshness, volume, schema, distribution, custom rules.
- Pipelines: runs, failures, duration anomalies.
- Infra/compute: Spark stage failures, OOM, query latencies.
- Cost: $ por query, por dataset, por dominio.
- Usage: queries por usuario, deprecation candidates, hot paths.
Monte Carlo / Soda Cloud / Acceldata
- Detectan anomalías sin reglas explícitas (ML baseline).
- Notebooks de root cause.
- Lineage 5 niveles deep.
- Agentes IA (2025-2026): Sifflet Sentinel/Sage/Forge, Acceldata AI agents, Bigeye AI Trust Platform.
Linaje con OpenLineage
# Spark + OpenLineage
from openlineage.client import OpenLineageClient
client = OpenLineageClient.from_environment()
# Auto-instrumented via openlineage-spark-listener
Privacidad y compliance
PII tagging automático
- DataHub + Spark scanner detecta patrones (email regex, SSN, etc.).
- Tag aplicado en CI → propagación a Snowflake/UC/BigQuery masking policies.
Snowflake dynamic masking
CREATE OR REPLACE MASKING POLICY pii_email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ANALYTICS_PII', 'COMPLIANCE') THEN val
ELSE SHA2(val, 256)
END;
ALTER TABLE gold.dim_customer
MODIFY COLUMN email SET MASKING POLICY pii_email_mask;
Databricks UC row filter + column mask
CREATE FUNCTION mask_email(email STRING) RETURNS STRING
RETURN CASE
WHEN is_account_group_member('compliance') THEN email
ELSE sha2(email, 256)
END;
ALTER TABLE gold.dim_customer
ALTER COLUMN email SET MASK mask_email;
Retention
-- Iceberg expire snapshots viejos
CALL system.expire_snapshots(
table => 'silver.events',
older_than => current_timestamp() - INTERVAL '365' DAY,
retain_last => 30
);
GDPR right-to-be-forgotten
-- DELETE en Iceberg = merge-on-read deletion vector
DELETE FROM silver.users WHERE user_id = '<DSR-001>';
-- Snapshot auto-creado; rollback posible
Métricas governance
- % tablas Gold con owner asignado: 100% (gate Fase 2).
- % tablas Gold con contrato versionado: 100% (gate Fase 2).
- MTTD incidente quality: <15min.
- MTTR P0: <2h.
- Adopción catálogo (usuarios activos semanales): >70% analistas.