Trust

Logging & Monitoring Policy

Last reviewed May 24, 2026.

This policy describes what Bella logs, how long logs are retained, what's monitored, and how alerting is wired so that anomalies, failures, and security events are detected promptly.

1. Logging coverage

Application audit log — append-only record of every authentication event, financial mutation, integration connect / disconnect, kill-switch toggle, and offboarding action. Each entry includes the actor identity, the tenant context, the action, the outcome, and a correlation ID. Updates and deletes are blocked at the database trigger layer.
Provenance log — append-only timeline of every state change a customer record passes through (proposed, classified, approved, posted, rejected). Tied to the underlying audit log and staging queue rows.
Authentication log — every login attempt, MFA challenge, password change, and session issuance / revocation event.
Application logs — structured logs from each service include the request method and path, the response status, the timing, the actor identity (where available), and a correlation ID. Sensitive fields are redacted before logging.
Request logs — web-server access logs capture the public request envelope (method, path, status, timing, network attribution).
Error reports — uncaught exceptions and rejected promises are captured with stack traces, request context, and the originating tenant when known.
Cron and worker logs — every scheduled job records start time, duration, items processed, and the outcome.
Edge logs — Cloudflare logs capture WAF action, rate-limit triggers, and threat-intel classifications.

2. Retention

Application logs: 30 days hot, 90 days cold archive, then deleted.
Request logs: 30 days, then deleted.
Error reports and crash dumps: 90 days, then deleted.
Audit log and provenance log: 7 years per the Retention Policy.
During an active incident, retention is extended for any logs relevant to the incident until resolution.

3. Monitoring and alerting

Service health — supervised process manager restarts failed services automatically and emits an event on each restart. Repeated restarts within a short window escalate to an operator alert.
Error volume — a sudden spike in 5xx responses, uncaught exceptions, or third-party integration failures triggers an operator alert.
Authentication anomalies — concentrated failed-authentication patterns trigger network-layer rate limiting and an operator alert.
Integration health — each third-party integration tracks its own health state (OAuth refresh status, webhook latency, sync cursor freshness). Stale or failed integrations surface on the admin command center.
Cron failures — missed or failed cron jobs raise an operator alert; the next scheduled run picks up automatically.
Data anomalies — tenant-realm invariant violations, staging-queue rows that grow stale past the SLA, and offboarding-export failures all generate dedicated alerts.

4. Privacy considerations

Restricted-classification data (OAuth tokens, Plaid access tokens, payment credentials) is never logged.
Confidential data is redacted to identifiers (tenant_id, user_id) in application logs; the underlying detail is reachable only through the audit log under the access control rules described in the Access Control Policy.
Log access is restricted to engineers with a job-required reason. Log access is itself logged.

5. Review

Log coverage is reviewed at least quarterly. Gaps identified during incident reviews or vulnerability triage trigger immediate review.
Alert thresholds are tuned on the basis of incident postmortems — an incident that should have alerted but did not raises the priority of fixing that detection.
This policy is reviewed at least once per calendar year and on any material change to monitoring infrastructure.