Saltar a contenido

Guía 06 — Observability

5 pilares (Gartner 2025)

  1. Data quality: freshness, volume, schema, distribution, custom rules.
  2. Pipelines: runs, failures, duration anomalies, dependency lag.
  3. Infra / compute: Spark stage retries, OOM, executor lost, query latencies.
  4. Cost: $ por query, dataset, dominio, equipo.
  5. Usage: queries por usuario, hot datasets, deprecation candidates.

Selección de herramienta

Necesidad OSS Enterprise
Quality checks por dataset Soda Core, Great Expectations, dbt tests Monte Carlo, Bigeye, Anomalo
Pipeline monitoring Airflow UI + Elementary Acceldata, Sifflet
Lineage OpenLineage + Marquez Atlan, Collibra (lineage incluido)
Anomaly detection Soda + Elementary metrics Monte Carlo Foundations
Cost observability OpenCost, Vantage OSS Vantage, Apptio, CloudHealth
Agentic remediation (build custom) Sifflet Sentinel, Acceldata Agents, Bigeye AI Trust

Soda + Elementary (OSS stack)

Soda Cloud (free tier + paid)

soda scan -d trino \
  -c soda/configuration.yml \
  -srf soda/scan_results.json \
  soda_checks/

Elementary (dbt-native)

pip install elementary-data dbt-trino
edr report --profile elementary
edr send-report --slack-token $SLACK_TOKEN --slack-channel-name data-alerts

Métricas y SLOs

SLI SLO target Alert threshold
Freshness P95 dataset Gold <1h >30min over SLA
Pipeline success rate first try >95% <90% rolling 24h
Query P95 BI <5s >10s
Incident P0 MTTD <15min >30min
Incident P0 MTTR <2h >4h
Schema drift events / week 0 >1 (P1)

Dashboards Grafana

Dataset health

# Pipeline freshness lag (minutes)
max by (dataset) (
  (time() - dataset_last_update_timestamp{tier="gold"}) / 60
)

# Volume anomaly
abs(
  rate(rows_written{table="gold.fct_orders"}[1h])
  - avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h])
) / avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h]) > 0.5

Cost overview

# Cost per domain per day
sum by (domain) (
  data_compute_cost_usd_total{env="prod"}
)

Incident management

Severidad

  • P0: dato crítico negocio caído, dashboards ejecutivos rotos, ingresos en riesgo. MTTR <2h.
  • P1: dato importante con SLA roto, BI degradado. MTTR <8h.
  • P2: dataset analítico con anomalía no bloqueante. MTTR <72h.
  • P3: issue exploratorio sin impacto. MTTR <2 sem.

Workflow (Slack + PagerDuty + Jira)

1. Alerta dispara (Monte Carlo / Soda / Elementary)
2. PagerDuty notifica on-call si P0/P1
3. Channel #data-incidents creado automáticamente
4. Owner diagnostica; agente IA (Fase 5) sugiere root cause
5. Fix + verificación + ETA comunicada cada 30min
6. Resolved → postmortem dentro de 5 días hábiles si P0/P1

Postmortem template

# Incident YYYY-MM-DD-<short>

**Severity:** P0
**Duration:** 2h 15min
**Datasets affected:** gold.fct_orders, gold.dim_customer

## Timeline
- T0  10:15: Anomaly detected by Monte Carlo (volume drop 80%)
- T+5: On-call paged
- T+15: Root cause identified (Kafka topic re-partitioned, consumer offset reset)
- T+90: Fix deployed (offset adjustment + replay)
- T+135: Data backfilled and validated

## Root cause
[detalle técnico]

## Impact
- N dashboards affected
- N internal users
- $X estimated business impact

## What went well
- Detection within 5 min
- Clear runbook for offset reset

## What went poorly
- No automated test for partition count changes
- Slack alerts noisy → primary alert missed

## Action items
- [ ] Add automated test for partition consistency (owner: @x)
- [ ] Tune alert thresholds (owner: @y)
- [ ] Document offset reset SOP (owner: @z)

Agentic observability (2026)

Sifflet Sentinel / Acceldata / Bigeye

  • Detección sin reglas (ML baseline).
  • Root cause automatizado con análisis de linaje.
  • Sugerencias de fix; algunas con auto-PR.

Custom agent (ejemplo conceptual)

# ml-ai/agents/quality_agent.py — ver código completo

Costos de observabilidad

Tier Costo orientativo Cobertura
OSS (Soda Core + Elementary + Grafana) infra only ($200-500/mes) hasta 200 datasets
Soda Cloud $1-5K/mes 500-2000 datasets
Bigeye / Sifflet $20-60K/año 1000-5000 datasets
Monte Carlo $40-150K/año unlimited + agentic
Acceldata $50-200K/año + infra observability

Anti-patrones

  1. ❌ Alertas sin owner → ignoradas. → Cada alerta tiene owner_team.
  2. ❌ Alertas sobre-sensibles → fatigue → muted. → Calibrar trimestralmente.
  3. ❌ Observability sin SLOs → falta criterio. → Definir SLOs por tier dataset.
  4. ❌ Sin postmortem → repetición de incidentes. → Postmortem blameless obligatorio P0/P1.