Guía 06 — Observability¶

5 pilares (Gartner 2025)¶

Data quality: freshness, volume, schema, distribution, custom rules.
Pipelines: runs, failures, duration anomalies, dependency lag.
Infra / compute: Spark stage retries, OOM, executor lost, query latencies.
Cost: $ por query, dataset, dominio, equipo.
Usage: queries por usuario, hot datasets, deprecation candidates.

Selección de herramienta¶

Necesidad	OSS	Enterprise
Quality checks por dataset	Soda Core, Great Expectations, dbt tests	Monte Carlo, Bigeye, Anomalo
Pipeline monitoring	Airflow UI + Elementary	Acceldata, Sifflet
Lineage	OpenLineage + Marquez	Atlan, Collibra (lineage incluido)
Anomaly detection	Soda + Elementary metrics	Monte Carlo Foundations
Cost observability	OpenCost, Vantage OSS	Vantage, Apptio, CloudHealth
Agentic remediation	(build custom)	Sifflet Sentinel, Acceldata Agents, Bigeye AI Trust

Soda + Elementary (OSS stack)¶

Soda Cloud (free tier + paid)¶

soda scan -d trino \
  -c soda/configuration.yml \
  -srf soda/scan_results.json \
  soda_checks/

Elementary (dbt-native)¶

pip install elementary-data dbt-trino
edr report --profile elementary
edr send-report --slack-token $SLACK_TOKEN --slack-channel-name data-alerts

Métricas y SLOs¶

SLI	SLO target	Alert threshold
Freshness P95 dataset Gold	<1h	>30min over SLA
Pipeline success rate first try	>95%	<90% rolling 24h
Query P95 BI	<5s	>10s
Incident P0 MTTD	<15min	>30min
Incident P0 MTTR	<2h	>4h
Schema drift events / week	0	>1 (P1)

Dashboards Grafana¶

Dataset health¶

# Pipeline freshness lag (minutes)
max by (dataset) (
  (time() - dataset_last_update_timestamp{tier="gold"}) / 60
)

# Volume anomaly
abs(
  rate(rows_written{table="gold.fct_orders"}[1h])
  - avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h])
) / avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h]) > 0.5

Cost overview¶

# Cost per domain per day
sum by (domain) (
  data_compute_cost_usd_total{env="prod"}
)

Incident management¶

Severidad¶

P0: dato crítico negocio caído, dashboards ejecutivos rotos, ingresos en riesgo. MTTR <2h.
P1: dato importante con SLA roto, BI degradado. MTTR <8h.
P2: dataset analítico con anomalía no bloqueante. MTTR <72h.
P3: issue exploratorio sin impacto. MTTR <2 sem.

Workflow (Slack + PagerDuty + Jira)¶

1. Alerta dispara (Monte Carlo / Soda / Elementary)
   ↓
2. PagerDuty notifica on-call si P0/P1
   ↓
3. Channel #data-incidents creado automáticamente
   ↓
4. Owner diagnostica; agente IA (Fase 5) sugiere root cause
   ↓
5. Fix + verificación + ETA comunicada cada 30min
   ↓
6. Resolved → postmortem dentro de 5 días hábiles si P0/P1

Postmortem template¶

# Incident YYYY-MM-DD-<short>

**Severity:** P0
**Duration:** 2h 15min
**Datasets affected:** gold.fct_orders, gold.dim_customer

## Timeline
- T0  10:15: Anomaly detected by Monte Carlo (volume drop 80%)
- T+5: On-call paged
- T+15: Root cause identified (Kafka topic re-partitioned, consumer offset reset)
- T+90: Fix deployed (offset adjustment + replay)
- T+135: Data backfilled and validated

## Root cause
[detalle técnico]

## Impact
- N dashboards affected
- N internal users
- $X estimated business impact

## What went well
- Detection within 5 min
- Clear runbook for offset reset

## What went poorly
- No automated test for partition count changes
- Slack alerts noisy → primary alert missed

## Action items
- [ ] Add automated test for partition consistency (owner: @x)
- [ ] Tune alert thresholds (owner: @y)
- [ ] Document offset reset SOP (owner: @z)

Agentic observability (2026)¶

Sifflet Sentinel / Acceldata / Bigeye¶

Detección sin reglas (ML baseline).
Root cause automatizado con análisis de linaje.
Sugerencias de fix; algunas con auto-PR.

Custom agent (ejemplo conceptual)¶

# ml-ai/agents/quality_agent.py — ver código completo

Costos de observabilidad¶

Tier	Costo orientativo	Cobertura
OSS (Soda Core + Elementary + Grafana)	infra only ($200-500/mes)	hasta 200 datasets
Soda Cloud	$1-5K/mes	500-2000 datasets
Bigeye / Sifflet	$20-60K/año	1000-5000 datasets
Monte Carlo	$40-150K/año	unlimited + agentic
Acceldata	$50-200K/año	+ infra observability

Anti-patrones¶

❌ Alertas sin owner → ignoradas. → Cada alerta tiene owner_team.
❌ Alertas sobre-sensibles → fatigue → muted. → Calibrar trimestralmente.
❌ Observability sin SLOs → falta criterio. → Definir SLOs por tier dataset.
❌ Sin postmortem → repetición de incidentes. → Postmortem blameless obligatorio P0/P1.