Notification System
Session 7.7 · ~5 min read
The Core Problem
A notification system delivers messages to users across multiple channels: push notifications, email, SMS, and in-app messages. That sounds like a simple "send message" API. In practice, it is a delivery pipeline that must handle user preferences, rate limiting, priority ordering, retry logic, and channel-specific failure modes, all at scale.
When Shopify runs a flash sale and needs to notify 10 million users within an hour, the system must process roughly 2,800 notifications per second sustained, with bursts much higher. Each notification might fan out to multiple channels (push and email). The pipeline must not collapse under this load, and it must not spam users who opted out of marketing messages.
A notification system is a delivery pipeline with user preferences as the routing table. The preferences determine which channel, the priority determines when, and the rate limiter determines how often.
Notification Channels Compared
| Channel | Latency (P95) | Cost per Message | Reliability | Best For |
|---|---|---|---|---|
| Push (APNs/FCM) | < 5 seconds | Free (platform fee only) | Medium (device must be online) | Time-sensitive alerts, engagement |
| Email (SES/SendGrid) | < 30 seconds | $0.0001 - $0.001 | High (store-and-forward) | Receipts, marketing, detailed content |
| SMS (Twilio/SNS) | < 10 seconds | $0.005 - $0.05 | High (carrier delivery) | OTP, critical alerts, non-app users |
| In-App | < 1 second (if connected) | Free | Low (user must be in app) | Contextual updates, badges |
SMS is the most expensive by a large margin. At scale, SMS can cost $10,000 per day or more. A well-designed system uses SMS only for critical messages (OTPs, security alerts) and routes everything else through push or email. The user preference table is the first place to check, and cost is the second.
High-Level Architecture
Deduplication"] VAL --> PREF["Preference
Lookup"] PREF --> PRI["Priority Router"] PRI --> HQ["High Priority Queue
(OTP, security)"] PRI --> MQ["Medium Priority Queue
(transactional)"] PRI --> LQ["Low Priority Queue
(marketing)"] HQ --> PW["Push Worker"] HQ --> SW["SMS Worker"] MQ --> PW MQ --> EW["Email Worker"] LQ --> EW LQ --> PW PW --> APNS["APNs / FCM"] EW --> SES["SES / SendGrid"] SW --> TWI["Twilio / SNS"] PW --> DLQ["Dead Letter Queue
(failed deliveries)"] EW --> DLQ SW --> DLQ style API fill:#222221,stroke:#c8a882,color:#ede9e3 style VAL fill:#222221,stroke:#c8a882,color:#ede9e3 style PREF fill:#222221,stroke:#c8a882,color:#ede9e3 style PRI fill:#222221,stroke:#c47a5a,color:#ede9e3 style HQ fill:#222221,stroke:#c47a5a,color:#ede9e3 style MQ fill:#222221,stroke:#6b8f71,color:#ede9e3 style LQ fill:#222221,stroke:#8a8478,color:#ede9e3 style PW fill:#222221,stroke:#c8a882,color:#ede9e3 style EW fill:#222221,stroke:#c8a882,color:#ede9e3 style SW fill:#222221,stroke:#c8a882,color:#ede9e3 style APNS fill:#222221,stroke:#6b8f71,color:#ede9e3 style SES fill:#222221,stroke:#6b8f71,color:#ede9e3 style TWI fill:#222221,stroke:#6b8f71,color:#ede9e3 style DLQ fill:#222221,stroke:#8a8478,color:#ede9e3 style P1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style P2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style P3 fill:#222221,stroke:#6b8f71,color:#ede9e3
The architecture separates concerns into stages. Producers submit notification requests. The API validates and deduplicates. The Preference Lookup determines which channels the user has enabled. The Priority Router assigns the notification to the correct queue. Workers consume from queues and deliver via third-party providers. Failed deliveries land in a Dead Letter Queue for retry or investigation.
Delivery Flow with Retry
The retry strategy uses exponential backoff with jitter. Without jitter, all failed notifications retry at the same instant, creating a thundering herd that overwhelms the downstream provider again. Adding random jitter (say, plus or minus 20% of the backoff interval) spreads retries over time.
Rate Limiting
Rate limiting operates at two levels. Per-user rate limiting prevents notification fatigue: no user should receive more than N push notifications per hour, regardless of how many services want to notify them. Per-channel rate limiting respects provider quotas: APNs and FCM impose rate limits, and exceeding them results in dropped messages or temporary bans.
A sliding window counter in Redis works well for per-user limits. The key is rate:{user_id}:{channel}:{hour}, incremented on each send. If the counter exceeds the threshold, the notification is either downgraded to a lower-priority channel or deferred to the next window.
User Preferences as the Routing Table
Each user has a preference record that determines what they receive and how. A simplified model:
{
"user_id": "u_12345",
"channels": {
"push": true,
"email": true,
"sms": false
},
"categories": {
"order_updates": ["push", "email"],
"marketing": ["email"],
"security": ["push", "sms", "email"]
},
"quiet_hours": { "start": "23:00", "end": "07:00", "timezone": "Asia/Jakarta" }
}
When the Notification API receives a security alert, it looks up the user's preferences, finds that security notifications go to push, SMS, and email, and fans out to all three channels. A marketing notification only goes to email. During quiet hours, non-critical notifications are deferred until the window opens.
This preference lookup is the single most important step in the pipeline. Without it, the system is a spam cannon. With it, the system respects user intent.
Handling Downstream Failures
What happens when the push notification service (APNs or FCM) goes down entirely? The system needs a fallback strategy. If push delivery fails after all retries, the system can automatically escalate to email (cheaper and more reliable for store-and-forward delivery). For critical notifications like OTPs, the fallback chain might be: push, then SMS, then email. The fallback logic lives in the worker, not the producer. Producers should not need to know about delivery infrastructure.
Further Reading
- Notification System Design: Architecture and Best Practices (MagicBell Engineering)
- Design a Scalable Notification Service (AlgoMaster)
- How to Design a Notification System: A Complete Guide (System Design Handbook)
- Designing a Notification System: Push, Email, and SMS at Scale (DEV Community)
Assignment
A flash sale starts in 1 hour. You need to notify 10 million users.
- Calculate the sustained throughput required if you want all notifications delivered within 30 minutes. How many worker instances do you need if each worker processes 500 notifications per second?
- Design the delivery pipeline. What queue system do you use? How do you partition the work across workers?
- The push notification provider (FCM) goes down 10 minutes before the sale. What is your fallback plan? How do you re-route notifications already in the push queue to email?
- A user has opted out of marketing notifications but this flash sale is for an item in their wishlist. Should you send it? Design the preference lookup logic that handles this edge case.