Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Design a Payment System
1. Requirements Clarification
Functional Requirements
| Feature | Description |
|---|---|
| Payment Processing | Accept payments via cards, bank transfers, digital wallets |
| Merchant Integration | APIs for businesses to accept payments |
| Refunds | Process full and partial refunds |
| Payouts | Transfer funds to merchant bank accounts |
| Recurring Payments | Subscriptions and scheduled payments |
| Fraud Detection | Real-time fraud screening |
| Reporting | Transaction history and analytics |
Non-Functional Requirements
- Reliability: 99.999% uptime (5 nines = 5 minutes downtime/year)
- Consistency: EXACTLY-ONCE payment processing
- Latency: Payment authorization < 500ms
- Security: PCI-DSS Level 1 compliance
- Auditability: Complete audit trail for every transaction
Capacity Estimation
2. High-Level Architecture
3. Core Components Deep Dive
3.1 Payment Flow
3.2 Idempotency (Critical!)
Every payment operation MUST be idempotent. Network failures happen, and retries must not create duplicate charges.3.3 Double-Entry Ledger
Every payment must be recorded using double-entry bookkeeping—this is non-negotiable for financial systems.3.4 Payment State Machine
3.5 Risk Engine
Real-time fraud detection is essential:3.6 Payment Routing
Intelligent routing to optimize success rates and minimize costs:4. Settlement and Reconciliation
Settlement Flow
5. Reliability Patterns
Exactly-Once Processing
Circuit Breaker for Payment Networks
6. Security Considerations
PCI-DSS Compliance
| Requirement | Implementation |
|---|---|
| Encrypt cardholder data | TLS in transit, AES-256 at rest |
| Never store CVV | Process and discard immediately |
| Tokenization | Replace card numbers with tokens |
| Access control | Role-based access, audit logging |
| Network segmentation | Cardholder data in isolated network |
Card Tokenization
7. Database Schema
8. Interview Tips
Common Follow-ups
How do you handle network failures during authorization?
How do you handle network failures during authorization?
How do you handle currency conversion?
How do you handle currency conversion?
- Lock exchange rate at payment creation time
- Store both currencies (original and converted)
- Daily rate updates from forex provider
- Margin buffer for rate fluctuation during settlement
How do you prevent double-charging?
How do you prevent double-charging?
- Client-generated idempotency key - Required on all requests
- Database unique constraint - Prevent duplicate inserts
- Distributed lock - Prevent concurrent processing
- Audit log - Trace all operations for investigation
Key Trade-offs
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Consistency | Strong (slow) | Eventual (fast) | Strong for payments — this is non-negotiable. In a financial system, the cost of inconsistency is not a bad user experience; it is actual money lost or regulatory violations. You are trading ~50ms of additional latency for correctness, which is always worth it when money is involved. |
| Ledger DB | RDBMS | Append-only log | RDBMS with immutable entries. Never UPDATE or DELETE ledger rows — only INSERT. This gives you a complete audit trail. Stripe, Square, and every serious payment system uses append-only ledger entries. The trade-off: storage grows linearly, but ledger data compresses well and storage is cheap compared to the cost of a lost audit trail. |
| Payment state | In-memory | Database | Database with cache. Payment state MUST survive process restarts. In-memory state means a server crash mid-transaction can leave payments in an unknown state. Use the database as the source of truth with Redis caching the hot path for read performance. |
| Event publishing | Sync | Async (outbox) | Outbox pattern — this is the only safe approach. If you publish events synchronously, a network failure after database commit but before event publish means your downstream systems never learn about the payment. The outbox pattern guarantees that if the payment is recorded, the event will eventually be published. |
Common Candidate Mistakes
9. Summary
Key Takeaways:- Idempotency is non-negotiable - Every operation must be safely retryable
- Double-entry ledger - Money in must equal money out
- State machine - Enforce valid payment transitions
- Reconciliation - Trust but verify against external sources
- Circuit breakers - Fail fast when networks are unhealthy
Interview Deep-Dive Questions
Q1: Explain how you would implement idempotency for payment processing. What happens when a network timeout occurs after the payment processor has charged the card but before your system receives the response?
Q1: Explain how you would implement idempotency for payment processing. What happens when a network timeout occurs after the payment processor has charged the card but before your system receives the response?
- Idempotency in payments means that retrying the same request produces the same result without creating a duplicate charge. The client generates a unique idempotency key (typically a UUID) and sends it with every payment request. The server stores a mapping from idempotency key to the result.
- The straightforward cases are simple: if the key exists and the result is SUCCESS, return the cached success. If the key exists and the result is FAILED, return the cached failure (or allow a new attempt depending on the failure type).
- The hard case is the “unknown outcome” timeout. The payment processor charged the card, but your system never received the confirmation. If you clear the idempotency record and the client retries, you risk a double charge. If you do not clear it, the payment sits in limbo.
- The correct design: (1) Mark the idempotency record as
PROCESSINGbefore calling the processor. (2) On timeout, mark it asUNKNOWN— do NOT delete it. (3) Before any retry, query the payment processor’s API using the processor’s reference ID to check the transaction status. This is called a “status inquiry” or “lookup.” (4) If the processor says it succeeded, update your ledger and return success. If it says it failed, clear the record and allow retry. If the processor also does not know (rare), escalate to manual review. - The idempotency store should have a TTL (24-48 hours) so it does not grow unbounded, but the TTL must be longer than any possible retry window.
- Example: Stripe’s idempotency implementation stores the complete HTTP response for each idempotency key. On retry, they replay the exact same response, including headers and status code. They keep idempotency keys for 24 hours. Their documentation explicitly warns that clearing a key on failure can cause double charges if the original request actually succeeded at the processor level.
- How do you handle the case where two instances of your payment service receive the same idempotency key at the same time? What prevents both from calling the processor?
- Should the idempotency key be client-generated or server-generated? What are the security implications of trusting client-generated keys?
Q2: How would you design the reconciliation system for a payment platform processing 10 million transactions per day?
Q2: How would you design the reconciliation system for a payment platform processing 10 million transactions per day?
- Reconciliation is the process of comparing your internal records against external sources (card networks, bank statements, processor settlement reports) to find discrepancies. At 10M transactions/day, you cannot do this manually — it must be automated with human investigation only for flagged exceptions.
- Three-way reconciliation: Compare (1) your ledger entries, (2) the payment processor’s settlement report, and (3) the bank statement. All three must agree on amount, currency, date, and status for each transaction. Any mismatch is a “break” that requires investigation.
- Types of discrepancies: (a) Missing from processor — you recorded a payment but the processor has no record (likely a timeout where the payment never reached them). (b) Missing from your ledger — the processor has a record you do not (extremely dangerous — means money moved without your knowledge). (c) Amount mismatch — often caused by currency conversion differences or partial captures. (d) Status mismatch — you think it is captured, processor says it was refunded.
- The reconciliation pipeline: (1) Ingest settlement files from processors (usually CSV/SFTP delivered daily). (2) Normalize into a common format. (3) Match against your ledger using composite keys (processor reference + amount + date). (4) Flag unmatched or mismatched records. (5) Auto-resolve common known patterns (e.g., timing differences where a transaction on your side at 11:59 PM appears on the processor’s report the next day). (6) Route remaining breaks to an operations dashboard for human investigation.
- Run reconciliation on T+1 (one day after the transaction date) and again on T+3 for stragglers. Set SLAs: 99% of breaks should be auto-resolved, remaining 1% investigated within 24 hours.
- Example: Stripe runs continuous reconciliation (not just daily). They match every authorization, capture, and refund event against processor responses in near-real-time. When they detect a discrepancy, an automated system attempts resolution before any human is paged. They reportedly reconcile over $1 trillion in annual payment volume this way.
- What happens when reconciliation finds that the processor charged a customer but your ledger has no record? How do you fix this without the customer noticing?
- How would you handle reconciliation across multiple processors in different currencies with different settlement timelines (Visa settles in T+2, some bank transfers in T+5)?
Q3: How do you minimize PCI-DSS scope in your payment system architecture, and what is the role of tokenization?
Q3: How do you minimize PCI-DSS scope in your payment system architecture, and what is the role of tokenization?
- PCI-DSS compliance applies to any system that stores, processes, or transmits cardholder data (card number, expiration date, CVV). The compliance burden is enormous: network segmentation, regular penetration testing, quarterly vulnerability scans, annual audits. The goal is to minimize the number of components that fall “in scope.”
- The primary technique is tokenization at the edge: capture card data in the browser or mobile app using a hosted payment form (like Stripe Elements or Braintree’s Drop-in UI). The card number goes directly from the client to the payment processor’s tokenization service — your backend never sees it. You receive a token (
tok_abc123) that represents the card. All subsequent operations (charge, refund, recurring billing) use the token. - This means your entire backend — API servers, databases, message queues, logging systems — is out of PCI scope. Only the client-side JavaScript snippet (which you do not host — the processor does) touches raw card data.
- CVV must never be stored, even encrypted. You capture it at payment time, send it to the processor for verification, and discard it immediately. No database, no log, no cache should ever contain a CVV.
- For token storage: tokens are not sensitive under PCI-DSS (they are meaningless without access to the processor’s vault), so you can store them in your regular database. However, you should still encrypt them at rest and restrict access as defense-in-depth.
- Network segmentation: if any component does handle raw card data (e.g., you build your own tokenization service for some reason), it must be in an isolated network segment with strict firewall rules, dedicated logging that does not commingle with non-PCI systems, and limited human access with MFA.
- Example: When Square built their payment system, they created a separate HSM-backed “card vault” microservice that is the only system handling raw card numbers. Everything else uses tokens. This vault has its own deployment pipeline, its own audit trail, and a team of fewer than 10 engineers with access — versus hundreds of engineers working on the rest of the platform.
- A merchant wants to display the last four digits of the card on their order confirmation page. Does storing last-four digits put your system in PCI scope? What about BIN (first six digits)?
- Your logging system accidentally captured a full card number in a debug log (this happens more often than people think). What is your incident response procedure?
Q4: Design the refund flow for a payment system. What makes refunds significantly harder than payments, and how do you handle partial refunds?
Q4: Design the refund flow for a payment system. What makes refunds significantly harder than payments, and how do you handle partial refunds?
- Refunds are harder than payments for several reasons: (1) A refund depends on the original payment’s state — you can only refund a CAPTURED payment, not an AUTHORIZED or PENDING one (for those, you VOID). (2) The refund cannot exceed the original charge amount, but can be partial. (3) Refunds go through different processing rails (the return path to the card) and have different timelines (3-10 business days vs. instant authorization). (4) Multiple partial refunds against a single payment must be tracked and their total must not exceed the original amount.
- The state model: a CAPTURED payment can transition to PARTIALLY_REFUNDED (some amount returned) or REFUNDED (full amount returned). Each refund is a separate ledger transaction linked to the original payment ID. The system must track
total_refundedper payment and reject any refund request whereamount_requested + total_already_refunded > original_amount. - Ledger entries for a refund: reverse the original entries. If the original payment was
Customer: -$100, Merchant: +$100, the refund createsCustomer: +$50, Merchant: -$50(for a partial refund of $50). The double-entry invariant (sum = 0) is maintained. - Edge cases: (a) Refund after settlement: if the merchant has already been paid out, the refund amount must be deducted from their next payout or clawed back. (b) Refund on an expired card: the card network will attempt to route the refund to the new card or issue a check. Your system must handle the “refund failed” callback. (c) Currency fluctuation: if the original payment was in EUR and the refund happens 30 days later, the exchange rate may have changed. Typically, you refund the exact original amount in the original currency, not the converted amount.
- Example: Stripe’s refund API is simple on the surface (
POST /refundswithpayment_intentandamount) but behind the scenes, they track aamount_refundablefield that is updated atomically with each refund. They also distinguish betweenrefund(goes back to the original payment method) andcredit(applied as platform credit), each with different ledger treatment.
- A customer disputes a charge (chargeback) at the same time the merchant initiates a refund. Now you have both a refund and a chargeback for the same payment. How do you prevent the customer from getting their money back twice?
- How would you design the refund system to support “instant refunds” (crediting the customer immediately from your platform’s balance) while the actual card refund processes over 5-10 days?
Q5: Explain double-entry bookkeeping in the context of a payment ledger. Why is it critical, and how do you detect ledger inconsistencies?
Q5: Explain double-entry bookkeeping in the context of a payment ledger. Why is it critical, and how do you detect ledger inconsistencies?
- Double-entry bookkeeping is a 600-year-old accounting principle: every financial transaction creates exactly two entries — a debit and a credit — that sum to zero. In a payment system, this means every dollar that moves must be accounted for on both sides. If
$100leaves a customer’s account, exactly$100must arrive somewhere (merchant account, platform fee account, tax account, etc.). - In database terms: a
ledger_entriestable where each row is one side of a transaction. A payment of$100with a$3fee creates three entries: Customer-$100, Merchant+$97, Platform Fee Account+$3. Sum = 0. This is enforced as a database constraint: the INSERT is wrapped in a transaction, and before committing, you verify that the sum of all entries for the transaction ID equals zero. If not, the transaction rolls back. - Ledger entries are append-only and immutable. You never UPDATE or DELETE a ledger row. If you need to correct something, you create new entries that reverse the original (this is how refunds and adjustments work). This gives you a complete, auditable history of every financial event.
- For querying balances: the balance of any account is simply
SELECT SUM(amount) FROM ledger_entries WHERE account_id = ?. At scale, you maintain a materializedaccount_balancestable that is updated transactionally alongside new ledger entries, but the ledger entries remain the source of truth. If the balance table ever disagrees with the ledger sum, you trust the ledger. - Detecting inconsistencies: run a periodic “balance proof” job that recalculates every account balance from the raw ledger entries and compares it against the materialized balance table. Any discrepancy means a bug or data corruption. Also run a “cross-foot” check: the global sum of ALL ledger entries across ALL accounts must equal zero. If it does not, money has appeared from or disappeared into nowhere.
- Example: Stripe’s ledger system processes millions of entries per day and maintains the zero-sum invariant at the transaction level. Their “balance transaction” API exposes this: every balance change shows both sides of the entry, making it trivial for merchants to reconcile. When they detect an imbalance (rare, but it happens due to distributed system edge cases), it triggers a P0 incident.
balance field on each account and increment/decrement it (this makes it impossible to audit where money came from or went, and balance corruption is undetectable).Follow-ups:- At 10M transactions/day, the ledger table grows by 30M+ rows/day (multiple entries per transaction). How do you handle queries like “give me this merchant’s balance” without scanning millions of rows?
- How would you handle multi-currency transactions in the ledger? If a customer pays in EUR but the merchant’s account is in USD, how many ledger entries are created and what accounts are involved?
Q6: How would you design the payment routing system to maximize authorization success rates while minimizing processing costs?
Q6: How would you design the payment routing system to maximize authorization success rates while minimizing processing costs?
Q7: What happens between authorization and settlement in a card payment? Why is this two-step process necessary, and what can go wrong?
Q7: What happens between authorization and settlement in a card payment? Why is this two-step process necessary, and what can go wrong?
Q8: How do you ensure exactly-once processing in a payment system when messages can be delivered more than once and services can crash mid-operation?
Q8: How do you ensure exactly-once processing in a payment system when messages can be delivered more than once and services can crash mid-operation?
- True exactly-once delivery is impossible in a distributed system (this is a consequence of the Two Generals Problem). What we achieve is “effectively exactly-once processing” by combining at-least-once delivery with idempotent processing. Messages may arrive more than once, but processing them multiple times produces the same result as processing them once.
- The transactional outbox pattern is the key technique: (1) Within a single database transaction, write the payment record to the payments table AND write an event to an outbox table. This guarantees that either both are written or neither is. (2) A separate “outbox relay” process polls the outbox table, publishes events to Kafka, and marks them as published. (3) If the relay crashes after publishing but before marking, it will re-publish on recovery — but downstream consumers are idempotent, so this is safe.
- Every downstream consumer must be idempotent. When the settlement service receives a
payment.capturedevent, it checks if it has already processed that event (using the event ID or payment ID as a deduplication key). If yes, it skips processing. - For the payment service itself: the idempotency key prevents duplicate charges even if the API request is retried. The state machine prevents invalid transitions (you cannot capture an already-captured payment). Together, these make the entire pipeline effectively exactly-once.
- The alternative — using distributed transactions (2PC) — is impractical at scale because it requires all participants to be available and introduces a global coordinator that becomes a single point of failure. The outbox pattern trades global coordination for eventual consistency, which is acceptable because settlement and notifications do not need to be synchronous with payment processing.
- Example: Stripe uses a variant of the outbox pattern they call “reliable event delivery.” Every state change in their payment lifecycle writes to both the payments table and an events log in the same transaction. A dedicated event delivery system then fans out these events to webhooks and internal consumers, with at-least-once guarantees and consumer-side deduplication.
- The outbox relay process crashes after publishing an event to Kafka but before marking the outbox row as processed. On recovery, it re-publishes the event. The settlement service receives the duplicate. Walk through exactly how the settlement service deduplicates this.
- How long do you keep idempotency records and outbox entries? What is the trade-off between storage cost and safety window?