Building Screenshot Infrastructure at Scale

Operating a screenshot API at scale involves challenges that do not appear in small deployments. A single Puppeteer script on your laptop works fine for 10 screenshots. Serving 10,000 concurrent users requires fundamentally different architecture.

This post covers the key lessons we have learned scaling SavePage.io.

Browser pool management

The most expensive resource in a screenshot service is browser instances. Each Chromium process consumes 100-300 MB of memory and a meaningful amount of CPU. You cannot start a fresh browser for every request.

The solution is a browser pool: a set of pre-started Chrome instances that are reused across requests. The pool manager handles:

Initialization -- Starting N browser instances at boot
Allocation -- Assigning an available instance to an incoming request
Return -- Returning the instance to the pool after capture
Recycling -- Replacing instances after a fixed number of uses (prevents memory leaks)
Health checks -- Detecting and replacing crashed instances

The pool size is tuned to the available memory. Each worker server runs as many instances as memory allows, typically 20-40 on a 16 GB machine.

Request queuing

Not every request can be served immediately. When all browser instances are busy, new requests wait in a queue. The queue provides:

Fairness -- No single API key can monopolize the queue
Backpressure -- When the queue is full, new requests get a 503 response instead of waiting indefinitely
Priority -- Pro and Enterprise requests are dequeued before Free tier requests
Timeout -- Requests that wait too long in the queue are rejected with a 504

Failure modes

At scale, things fail constantly. The most common failures:

Browser crash. Certain pages cause Chromium to crash or hang. A malformed PDF embed, an infinite JavaScript loop, or an out-of-memory condition can kill the renderer. The pool manager detects the dead process and spawns a replacement.

Network timeout. The target URL may be slow, unreachable, or returning errors. We set aggressive timeouts: 30 seconds for page load, 10 seconds for network idle. If the page does not load, the request fails with a clear error.

Disk full. Screenshots generate large files. If the disk fills up, everything stops. We monitor disk usage and clean up temporary files on a schedule.

CDN upload failure. After capturing a screenshot, it must be uploaded to object storage. If the upload fails, we retry twice before returning an error.

Each failure is logged, categorized, and monitored. We track failure rates by error type and alert if any category exceeds its baseline.

Horizontal scaling

A single server has a fixed capacity determined by its CPU and memory. To handle more traffic, we add more servers. Each server runs its own browser pool and connects to the same request queue.

The load balancer distributes API requests across servers. The queue ensures that even if one server is overloaded, requests are eventually served by another.

Scaling up is adding servers. Scaling down is removing them (after draining their queues). We use auto-scaling based on queue depth: if requests are waiting, add capacity.

Monitoring

The metrics we track:

Request latency (p50, p95, p99) -- How long from request to response
Queue depth -- How many requests are waiting
Browser pool utilization -- What percentage of instances are active
Error rate -- What percentage of requests fail, by error type
Memory usage -- Per-server and per-browser-instance
CDN upload latency -- Time to store and make screenshots available

These metrics drive both alerts and auto-scaling decisions.

Cost optimization

The biggest cost is compute (CPU and memory for browser instances). Optimizations include:

Instance recycling -- Reuse instances instead of starting new ones
Request deduplication -- If two users request the same URL with the same parameters within a short window, serve the cached result
Image compression -- PNG optimization and JPEG quality tuning reduce storage and bandwidth costs
Spot instances -- Using interruptible compute for the rendering fleet (with graceful drain on interruption)

These optimizations reduce the per-screenshot cost to a level where the free tier is sustainable and the paid tiers are profitable.