The Challenge
Northwind Finance was growing fast — but their settlement infrastructure was not. Every merchant payout ran through a nightly batch job that took 14–18 hours to complete. As the platform scaled toward 500 merchants, the complaints multiplied: merchants couldn't reconcile same-day transactions, support tickets for "missing funds" consumed the operations team, and two enterprise clients had already walked because Northwind couldn't offer the instant payout terms their competitors could. The deeper problem was architectural. The existing system had been built as a monolith in 2017, with settlement logic tightly coupled to a single PostgreSQL database that was already showing signs of strain at current volume. Any attempt to add real-time capability to the existing system risked destabilising the batch process that the business depended on.
Our Approach
We proposed building a parallel event-driven settlement layer alongside the existing system — not replacing it in a big-bang rewrite, but introducing a new path for real-time payouts that the existing batch system could eventually be migrated onto, module by module. The architecture centred on an idempotent ledger service that processed payment events from an AWS SQS queue. Each event was assigned a deterministic idempotency key derived from the payment ID and timestamp, ensuring that duplicate events — inevitable in any distributed system — could never result in double-payouts. A webhook fan-out service then notified merchants in real time, with exponential-backoff retry and dead-letter queues for failed deliveries.
How We Built It
Discovery & threat modelling
Two-week deep-dive into the existing settlement codebase, identifying all failure modes, race conditions and the exact database tables that would need to be mirrored in the new system. We produced a threat model for the payment flow before writing a line of new code.
Idempotent ledger service (weeks 3–6)
Built and load-tested the core ledger service in isolation — 10,000 concurrent events, deliberate duplicate injection, and chaos engineering to validate failure recovery. Zero data inconsistencies observed across all test scenarios.
Webhook fan-out and merchant portal (weeks 7–11)
Implemented the webhook delivery system with per-merchant configurable retry policies, delivery receipts and a self-service portal where merchants could view real-time payout status and re-trigger failed webhooks without raising a support ticket.
Parallel running and cutover (weeks 12–14)
Ran both systems in parallel for two weeks, comparing outputs on every transaction. After zero discrepancies across 1.2 million test events, we cut over live traffic incrementally — 5%, 25%, 100% — with instant rollback capability at each stage.