Guía 06 — Observability
5 pilares (Gartner 2025)
- Data quality: freshness, volume, schema, distribution, custom rules.
- Pipelines: runs, failures, duration anomalies, dependency lag.
- Infra / compute: Spark stage retries, OOM, executor lost, query latencies.
- Cost: $ por query, dataset, dominio, equipo.
- Usage: queries por usuario, hot datasets, deprecation candidates.
Selección de herramienta
| Necesidad |
OSS |
Enterprise |
| Quality checks por dataset |
Soda Core, Great Expectations, dbt tests |
Monte Carlo, Bigeye, Anomalo |
| Pipeline monitoring |
Airflow UI + Elementary |
Acceldata, Sifflet |
| Lineage |
OpenLineage + Marquez |
Atlan, Collibra (lineage incluido) |
| Anomaly detection |
Soda + Elementary metrics |
Monte Carlo Foundations |
| Cost observability |
OpenCost, Vantage OSS |
Vantage, Apptio, CloudHealth |
| Agentic remediation |
(build custom) |
Sifflet Sentinel, Acceldata Agents, Bigeye AI Trust |
Soda + Elementary (OSS stack)
Soda Cloud (free tier + paid)
soda scan -d trino \
-c soda/configuration.yml \
-srf soda/scan_results.json \
soda_checks/
Elementary (dbt-native)
pip install elementary-data dbt-trino
edr report --profile elementary
edr send-report --slack-token $SLACK_TOKEN --slack-channel-name data-alerts
Métricas y SLOs
| SLI |
SLO target |
Alert threshold |
| Freshness P95 dataset Gold |
<1h |
>30min over SLA |
| Pipeline success rate first try |
>95% |
<90% rolling 24h |
| Query P95 BI |
<5s |
>10s |
| Incident P0 MTTD |
<15min |
>30min |
| Incident P0 MTTR |
<2h |
>4h |
| Schema drift events / week |
0 |
>1 (P1) |
Dashboards Grafana
Dataset health
# Pipeline freshness lag (minutes)
max by (dataset) (
(time() - dataset_last_update_timestamp{tier="gold"}) / 60
)
# Volume anomaly
abs(
rate(rows_written{table="gold.fct_orders"}[1h])
- avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h])
) / avg_over_time(rate(rows_written{table="gold.fct_orders"}[1h])[7d:1h]) > 0.5
Cost overview
# Cost per domain per day
sum by (domain) (
data_compute_cost_usd_total{env="prod"}
)
Incident management
Severidad
- P0: dato crítico negocio caído, dashboards ejecutivos rotos, ingresos en riesgo. MTTR <2h.
- P1: dato importante con SLA roto, BI degradado. MTTR <8h.
- P2: dataset analítico con anomalía no bloqueante. MTTR <72h.
- P3: issue exploratorio sin impacto. MTTR <2 sem.
1. Alerta dispara (Monte Carlo / Soda / Elementary)
↓
2. PagerDuty notifica on-call si P0/P1
↓
3. Channel #data-incidents creado automáticamente
↓
4. Owner diagnostica; agente IA (Fase 5) sugiere root cause
↓
5. Fix + verificación + ETA comunicada cada 30min
↓
6. Resolved → postmortem dentro de 5 días hábiles si P0/P1
Postmortem template
# Incident YYYY-MM-DD-<short>
**Severity:** P0
**Duration:** 2h 15min
**Datasets affected:** gold.fct_orders, gold.dim_customer
## Timeline
- T0 10:15: Anomaly detected by Monte Carlo (volume drop 80%)
- T+5: On-call paged
- T+15: Root cause identified (Kafka topic re-partitioned, consumer offset reset)
- T+90: Fix deployed (offset adjustment + replay)
- T+135: Data backfilled and validated
## Root cause
[detalle técnico]
## Impact
- N dashboards affected
- N internal users
- $X estimated business impact
## What went well
- Detection within 5 min
- Clear runbook for offset reset
## What went poorly
- No automated test for partition count changes
- Slack alerts noisy → primary alert missed
## Action items
- [ ] Add automated test for partition consistency (owner: @x)
- [ ] Tune alert thresholds (owner: @y)
- [ ] Document offset reset SOP (owner: @z)
Agentic observability (2026)
Sifflet Sentinel / Acceldata / Bigeye
- Detección sin reglas (ML baseline).
- Root cause automatizado con análisis de linaje.
- Sugerencias de fix; algunas con auto-PR.
Custom agent (ejemplo conceptual)
# ml-ai/agents/quality_agent.py — ver código completo
Costos de observabilidad
| Tier |
Costo orientativo |
Cobertura |
| OSS (Soda Core + Elementary + Grafana) |
infra only ($200-500/mes) |
hasta 200 datasets |
| Soda Cloud |
$1-5K/mes |
500-2000 datasets |
| Bigeye / Sifflet |
$20-60K/año |
1000-5000 datasets |
| Monte Carlo |
$40-150K/año |
unlimited + agentic |
| Acceldata |
$50-200K/año |
+ infra observability |
Anti-patrones
- ❌ Alertas sin owner → ignoradas. → Cada alerta tiene
owner_team.
- ❌ Alertas sobre-sensibles → fatigue → muted. → Calibrar trimestralmente.
- ❌ Observability sin SLOs → falta criterio. → Definir SLOs por tier dataset.
- ❌ Sin postmortem → repetición de incidentes. → Postmortem blameless obligatorio P0/P1.