ADR-002: Production Readiness is Continuous, Not a Milestone
Date: 2026-03-29 Status: Active Decision: Resilience, observability, security, and deployment readiness are standing requirements across all milestones, not a separate phase.
Context
The original roadmap had "Production Hardening" as Milestone 4 -- a final phase where retry policies, health checks, monitoring, and containerization would be added after all features were built.
This is a waterfall pattern. It assumes you can build features first and make them production-grade later. In practice, this leads to:
- Features shipped without error handling because "we'll add it in hardening"
- Observability bolted on after the fact, missing context that should have been there from the start
- A growing backlog of "hardening" work that never feels urgent until something breaks
A dev spike audit of the current codebase identified 25 production readiness gaps, ranging from critical (race condition in file writes, no API exception handling) to nice-to-have (metrics, rate limiting). These gaps exist because production concerns were deferred to a future milestone.
Decision
Production readiness is a standing requirement, not a milestone. Every milestone must ship with:
- Error handling -- failures are recovered gracefully, logged, and don't crash the pipeline
- Input validation -- external inputs are validated at system boundaries
- Timeouts -- external calls have configured timeouts
- Health checks -- runnable services expose health endpoints
- Configuration -- settings are overridable via environment variables
- Testability -- new code ships with tests, coverage doesn't decline
The following are addressed as needed rather than in a dedicated phase:
- Retry policies -- added when an adapter calls an external service
- Correlation IDs -- added when distributed tracing becomes necessary
- Containerization -- added when deployment to a container host is needed
- Metrics/observability -- added when the pipeline runs long enough to need monitoring
- Rate limiting -- added when the API is exposed beyond localhost
Current Gaps (from dev spike)
The audit identified these as the highest priority fixes to fold into the next milestone:
Critical:
- Race condition in JsonOutputWriter (concurrent writes can lose data)
- No global exception handler in API
- No HttpClient timeout configuration
- API reads files synchronously (thread pool starvation under load)
- API deserializes all records as FeedItem (breaks for 834/Zotero)
Important:
- No retry policies on external HTTP calls
- No health check endpoints
- Configuration not environment-variable driven
- Worker schedule hardcoded to 5 minutes
These will be addressed as part of the Data Transformation milestone or as standalone fixes, not as a separate "hardening" phase.
Consequences
- The "Production Hardening" milestone is removed from the roadmap
- Each milestone's PRD must include production readiness in its success criteria
- The dev spike gaps become a backlog of fixes to address incrementally
- Code review should catch missing error handling, validation, and timeouts before merge