Upload services fail silently. A 502 on a chunk request doesn't crash the app — Resumable.js retries it, the user waits a little longer, and maybe it works. Or maybe it doesn't, and you never find out because there's nothing in your dashboard. The upload endpoint returns 200 for each chunk, but the final assembly never happens. The user sees 100% and assumes success. Two days later they open a support ticket because their file is missing.
This is what happens when you build an upload pipeline without observability. Metrics and logs are how you find out about problems before your users complain — or at least at the same time.
Key metrics to track
Not every metric matters equally. Start with these and expand based on what you learn:
| Metric | What it tells you |
|---|---|
| Upload success rate | Percentage of uploads that complete all chunks and finalize. The single most important metric. |
| Chunk success rate | Percentage of individual chunk requests that return 2xx on the first attempt. Drops indicate network or server issues. |
| Chunk retry rate | How often chunks need to be retried. High retry rates signal instability even if uploads eventually succeed. |
| Throughput | Bytes received per second, aggregated across all uploads. Shows capacity utilization. |
| Chunk latency | Time from receiving a chunk request to returning a response. Track as a histogram, not an average. |
| Time-to-completion | Wall clock time from first chunk to final assembly per upload. Long completions may indicate timeout issues or stalled clients. |
| Incomplete upload count | Uploads that started (at least one chunk received) but never finalized. These are your silent failures. |
Structured logging
Unstructured log lines like Received chunk 5 of upload abc123 are human-readable but machine-hostile. Structured JSON logs let you query, filter, and aggregate.
Every chunk receipt should produce a log entry like this:
{
"timestamp": "2026-05-12T09:15:23.471Z",
"level": "info",
"event": "chunk_received",
"upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"correlation_id": "req-9f8e7d6c",
"chunk_number": 5,
"total_chunks": 47,
"chunk_size_bytes": 2097152,
"file_name": "project-assets.zip",
"total_file_size": 98566144,
"duration_ms": 142,
"status": "stored",
"storage_backend": "s3",
"client_ip": "203.0.113.42",
"user_agent": "Mozilla/5.0 ..."
}
And on completion:
{
"timestamp": "2026-05-12T09:16:01.892Z",
"level": "info",
"event": "upload_complete",
"upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"total_chunks": 47,
"total_bytes": 98566144,
"duration_total_ms": 38421,
"chunks_retried": 2,
"final_path": "uploads/2026/05/12/a1b2c3d4.zip"
}
The critical fields: upload_id ties all chunks of one upload together, chunk_number and total_chunks let you detect gaps, duration_ms feeds your latency metrics, and chunks_retried surfaces reliability issues.
For Python/Django/Flask receivers, a structured logging setup:
import structlog
import time
logger = structlog.get_logger()
def receive_chunk(request):
start = time.monotonic()
upload_id = request.POST['resumableIdentifier']
chunk_number = int(request.POST['resumableChunkNumber'])
total_chunks = int(request.POST['resumableTotalChunks'])
chunk_file = request.FILES['file']
# Store chunk...
store_chunk(upload_id, chunk_number, chunk_file)
duration_ms = (time.monotonic() - start) * 1000
logger.info("chunk_received",
upload_id=upload_id,
chunk_number=chunk_number,
total_chunks=total_chunks,
chunk_size_bytes=chunk_file.size,
duration_ms=round(duration_ms, 1),
)
return JsonResponse({"status": "ok"})
Correlation IDs: tracing an upload end to end
A single file upload generates dozens of HTTP requests (one per chunk, plus a test request per chunk if testChunks is enabled). Debugging a specific upload failure requires tracing all those requests together.
The resumableIdentifier parameter is your natural correlation ID — it's unique per file and present on every chunk request. Propagate it through your logs, metrics, and any downstream services (storage, virus scanning, post-processing).
On the client side, you can also capture and forward a correlation ID using Resumable.js configuration:
const r = new Resumable({
target: '/api/upload',
query: {
correlation_id: crypto.randomUUID(),
},
});
This adds a unique ID to every request that you can match against server logs, CDN logs, and storage access logs.
Client-side telemetry
The server sees chunks arrive but doesn't see the full client experience. Instrument Resumable.js events to feed client-side analytics:
r.on('fileSuccess', (file) => {
sendAnalytics('upload_complete', {
file_size: file.size,
chunk_count: file.chunks.length,
duration_ms: Date.now() - file._startTime,
});
});
r.on('fileError', (file, message) => {
sendAnalytics('upload_failed', {
file_size: file.size,
chunks_completed: file.chunks.filter(c => c.status() === 'success').length,
error: message,
});
});
r.on('fileRetry', (file) => {
sendAnalytics('chunk_retry', {
upload_id: file.uniqueIdentifier,
chunk: file.chunks.findIndex(c => c.status() === 'uploading'),
});
});
// Track when upload starts
r.on('fileAdded', (file) => {
file._startTime = Date.now();
});
Client-side telemetry captures what server logs miss: how long the user waited, how many retries the client attempted before giving up, and whether the user cancelled before completion.
Latency histograms: why averages lie
Average chunk upload latency is a useless metric. If 90% of chunks complete in 100ms and 10% take 5 seconds (retries, network hiccups, GC pauses), the average is 590ms — a number that describes nobody's actual experience.
Use percentile distributions instead:
- p50 (median): The typical chunk latency. This is your baseline.
- p90: What slower-than-normal chunks look like. Expect this to be 2-3x the median.
- p99: Your worst-case performance for 1 in 100 chunks. Spikes here indicate infrastructure problems.
In Prometheus, use a histogram:
# prometheus metric definition
upload_chunk_duration_seconds:
type: histogram
help: "Duration of chunk upload processing"
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
from prometheus_client import Histogram
CHUNK_DURATION = Histogram(
'upload_chunk_duration_seconds',
'Duration of chunk upload processing',
['storage_backend'],
buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)
def receive_chunk(request):
with CHUNK_DURATION.labels(storage_backend='s3').time():
process_chunk(request)
Monitor p99 chunk latency in your dashboards. A sudden spike from 500ms to 5s means something changed — a slow storage backend, a rate limit kicking in, or a network issue between your server and storage.
Retry metrics
Retries are the canary in your coal mine. Uploads can succeed even with high retry rates — Resumable.js is designed for that. But high retries mean:
- Users are waiting longer than necessary.
- Your server is doing redundant work (receiving and discarding duplicate chunks).
- Something in the pipeline is unstable.
Track retry rate at two levels:
- Per-chunk retry count: How many attempts it took for each chunk to succeed. A chunk that succeeds on the first try has a retry count of 0. A retry count of 3+ means the client hit
maxChunkRetriesand barely made it. - Per-upload retry total: Sum of all chunk retries for one upload. An upload with 50 chunks and 0 retries is healthy. The same upload with 15 retries means 30% of chunks failed at least once.
Alert when the per-upload retry rate exceeds a threshold (e.g., retries > 10% of total chunks sustained for 5 minutes).
Alerting patterns
Alerts should catch problems before they cascade. These patterns work for most upload services:
Sustained error rate: Alert if the chunk error rate (non-2xx responses / total responses) exceeds 5% for more than 3 minutes. Transient spikes happen; sustained errors indicate a real problem.
p99 latency spike: Alert if p99 chunk latency exceeds 3x its normal baseline for more than 5 minutes. This catches slow storage backends, network issues, or server overload.
Stuck uploads: Alert if an upload receives its first chunk but doesn't complete within a configurable window (e.g., 2 hours). This catches uploads abandoned by the client, stuck in processing, or missing their final chunk due to a bug.
Incomplete upload accumulation: Alert if the count of incomplete uploads (started but not finalized) grows beyond a threshold. A growing backlog means uploads are starting but not finishing — likely a server-side assembly or callback issue.
Dashboard layout
A file upload ops dashboard should answer three questions at a glance: Is it working? Is it fast? Are there problems?
Recommended panels:
- Upload success rate — single stat, big number, green/red threshold
- Chunk throughput — time series, bytes/sec aggregated
- Chunk latency percentiles — time series, p50/p90/p99 lines
- Error rate — time series, percentage of non-2xx chunk responses
- Retry rate — time series, retries per minute
- Active uploads — gauge, currently in-progress uploads
- Incomplete uploads — counter, uploads started but not completed in the last 24 hours
- Top errors — table, most frequent error messages grouped by type
Tools: Prometheus + Grafana is the most common open-source stack. DataDog, New Relic, and Cloudflare Analytics work for managed alternatives. For structured logs, pipe JSON logs to Elasticsearch/Kibana, Loki/Grafana, or your provider's log analytics.
Incomplete upload tracking
The most insidious failure mode: an upload starts, some chunks arrive, and then... nothing. The client disappeared (tab closed, network died, user gave up). The chunks sit in temporary storage consuming disk space and never get assembled.
Build a cleanup process:
from datetime import datetime, timedelta
def cleanup_stale_uploads():
"""Find uploads with chunks but no completion, older than threshold."""
threshold = datetime.utcnow() - timedelta(hours=24)
stale = db.query("""
SELECT upload_id, chunk_count, first_chunk_at, last_chunk_at
FROM uploads
WHERE status = 'in_progress'
AND last_chunk_at < %s
""", [threshold])
for upload in stale:
logger.warning("stale_upload_cleanup",
upload_id=upload.upload_id,
chunks_received=upload.chunk_count,
last_activity=upload.last_chunk_at.isoformat(),
)
delete_chunks(upload.upload_id)
mark_upload_failed(upload.upload_id)
Run this as a cron job or scheduled task. Log every cleanup — the stale upload rate is itself a useful metric. If it spikes, something is preventing clients from completing uploads.
Closing the loop
Observability isn't a one-time setup. Start with the basics — structured logs, success rate, and chunk latency — then expand as you learn where your pipeline breaks. The logs and metrics you add today are the ones that save you at 2 AM when uploads start failing and you need to figure out why.
Wire your Resumable.js events to client-side analytics. Wire your server to structured logging and metrics. Build a dashboard. Set up alerts. Then, when something breaks — and it will — you'll know about it before the support tickets arrive.
