Building Screenshot Infrastructure at Scale
Operating a screenshot API at scale involves challenges that do not appear in small deployments. A single Puppeteer script on your laptop works fine for 10 screenshots. Serving 10,000 concurrent users requires fundamentally different architecture.
This post covers the key lessons we have learned scaling SavePage.io.
Browser pool management
The most expensive resource in a screenshot service is browser instances. Each Chromium process consumes 100-300 MB of memory and a meaningful amount of CPU. You cannot start a fresh browser for every request.
The solution is a browser pool: a set of pre-started Chrome instances that are reused across requests. The pool manager handles:
- Initialization -- Starting N browser instances at boot
- Allocation -- Assigning an available instance to an incoming request
- Return -- Returning the instance to the pool after capture
- Recycling -- Replacing instances after a fixed number of uses (prevents memory leaks)
- Health checks -- Detecting and replacing crashed instances
The pool size is tuned to the available memory. Each worker server runs as many instances as memory allows, typically 20-40 on a 16 GB machine.
Request queuing
Not every request can be served immediately. When all browser instances are busy, new requests wait in a queue. The queue provides:
- Fairness -- No single API key can monopolize the queue
- Backpressure -- When the queue is full, new requests get a 503 response instead of waiting indefinitely
- Priority -- Pro and Enterprise requests are dequeued before Free tier requests
- Timeout -- Requests that wait too long in the queue are rejected with a 504
Failure modes
At scale, things fail constantly. The most common failures:
Browser crash. Certain pages cause Chromium to crash or hang. A malformed PDF embed, an infinite JavaScript loop, or an out-of-memory condition can kill the renderer. The pool manager detects the dead process and spawns a replacement.
Network timeout. The target URL may be slow, unreachable, or returning errors. We set aggressive timeouts: 30 seconds for page load, 10 seconds for network idle. If the page does not load, the request fails with a clear error.
Disk full. Screenshots generate large files. If the disk fills up, everything stops. We monitor disk usage and clean up temporary files on a schedule.
CDN upload failure. After capturing a screenshot, it must be uploaded to object storage. If the upload fails, we retry twice before returning an error.
Each failure is logged, categorized, and monitored. We track failure rates by error type and alert if any category exceeds its baseline.
Horizontal scaling
A single server has a fixed capacity determined by its CPU and memory. To handle more traffic, we add more servers. Each server runs its own browser pool and connects to the same request queue.
The load balancer distributes API requests across servers. The queue ensures that even if one server is overloaded, requests are eventually served by another.
Scaling up is adding servers. Scaling down is removing them (after draining their queues). We use auto-scaling based on queue depth: if requests are waiting, add capacity.
Monitoring
The metrics we track:
- Request latency (p50, p95, p99) -- How long from request to response
- Queue depth -- How many requests are waiting
- Browser pool utilization -- What percentage of instances are active
- Error rate -- What percentage of requests fail, by error type
- Memory usage -- Per-server and per-browser-instance
- CDN upload latency -- Time to store and make screenshots available
These metrics drive both alerts and auto-scaling decisions.
Cost optimization
The biggest cost is compute (CPU and memory for browser instances). Optimizations include:
- Instance recycling -- Reuse instances instead of starting new ones
- Request deduplication -- If two users request the same URL with the same parameters within a short window, serve the cached result
- Image compression -- PNG optimization and JPEG quality tuning reduce storage and bandwidth costs
- Spot instances -- Using interruptible compute for the rendering fleet (with graceful drain on interruption)
These optimizations reduce the per-screenshot cost to a level where the free tier is sustainable and the paid tiers are profitable.