Minwin handles a lot of visual content — profile avatars, post images, cover photos, product shots, and video. Every upload needs to be resized, optimized, and stored in multiple formats before it can be served. I built a media processing pipeline using Sharp for images and FFmpeg for video, orchestrated by BullMQ workers and stored in Cloudflare R2.
Image processing with Sharp
Every image that enters the system goes through a preset-based pipeline. Instead of ad-hoc resize calls scattered across the codebase, I defined a set of presets that map directly to how images are used in the app:
| Preset | Dimensions | Quality | Use case |
|---|---|---|---|
avatar | 400×400 | 80 | Profile pictures |
post | 1080×1350 | 80 | Feed posts |
cover | 1920×1080 | 80 | Profile covers |
product | 600×600 | 60 | Product listings |
thumbnail | 300×300 | 60 | Preview thumbnails |
All output is WebP. The quality split is intentional — hero content (avatars, posts, covers) gets quality 80 for visual fidelity, while secondary content (products, thumbnails) gets quality 60 to keep payload sizes down.
The Sharp pipeline for a single image looks like this:
await sharp(buffer)
.resize(preset.width, preset.height, { fit: 'cover', position: 'centre' })
.webp({ quality: preset.quality })
.toBuffer();
Every post upload generates two outputs: the full-size post image at 1080×1350 and a 300×300 thumbnail. The thumbnail is used in grids and previews where loading the full image would be wasteful.
Video transcoding with FFmpeg
Video is where things get more complex. Raw uploads can be anything — different codecs, resolutions, frame rates. The goal is to produce HLS (HTTP Live Streaming) output that works across all devices and adapts to the viewer’s bandwidth.
Each video gets transcoded into four renditions:
| Rendition | Resolution | Bitrate | Audio |
|---|---|---|---|
| 360p | 640×360 | 800k | 96k AAC |
| 480p | 854×480 | 1400k | 128k AAC |
| 720p | 1280×720 | 2800k | 128k AAC |
| 1080p | 1920×1080 | 5000k | 192k AAC |
The FFmpeg command generates all four renditions in a single pass, producing fragmented MP4 (FMP4) segments with 2-second segment durations. FMP4 over traditional MPEG-TS because it supports better seeking and is the direction Apple has been pushing HLS toward.
The output structure for a single video:
video_id/
├── master.m3u8 # Master playlist (points to renditions)
├── 360p/
│ ├── playlist.m3u8 # Rendition playlist
│ └── segment_%03d.m4s # 2-second segments
├── 480p/
│ └── ...
├── 720p/
│ └── ...
└── 1080p/
└── ...
The master playlist lists all renditions with their bandwidth and resolution metadata. The video player picks the appropriate rendition based on the viewer’s connection speed and switches between them seamlessly.
Queue architecture
Media processing is CPU-intensive and unpredictable in duration. It can’t happen in the request path. BullMQ handles the orchestration.
There are separate queues for images and video, each with its own worker configuration:
Image queue — processes multiple jobs concurrently. Image resizing with Sharp is fast (sub-second for most presets), so parallelism is fine. Failed jobs retry 3 times with exponential backoff.
Video queue — concurrency locked to 1. Video transcoding is heavy — it saturates CPU and memory. Running multiple FFmpeg processes simultaneously would degrade quality for all of them. Each video job gets a 10-minute lock timeout to accommodate longer videos without the job being considered stale.
The retry strategy uses exponential backoff:
{
attempts: 3,
backoff: {
type: 'exponential',
delay: 1000
}
}
If a job fails three times, it moves to the failed set where I can inspect it manually. Most failures are transient — R2 upload timeouts, corrupted input frames — so the retries handle them.
Storage in Cloudflare R2
All processed media goes to Cloudflare R2. The key structure is deterministic:
media/{mediaId}/post.webp
media/{mediaId}/thumbnail.webp
media/{mediaId}/video/master.m3u8
media/{mediaId}/video/720p/playlist.m3u8
media/{mediaId}/video/720p/segment_001.m4s
R2 was chosen over S3 for zero egress fees. When a post goes viral and gets millions of views, the storage cost stays flat. The R2 bucket sits behind Cloudflare’s CDN, so content is cached at the edge and served from the nearest POP to the viewer.
The upload flow end to end
- Client uploads the raw file to a presigned R2 URL
- The API receives a webhook confirming the upload, creates a
mediarecord with statusuploaded - A BullMQ job is enqueued for processing
- The worker downloads the raw file from R2, processes it through the appropriate preset pipeline
- Processed outputs are uploaded back to R2
- The media record status moves to
processed - If the media is a post, a second job fires to generate embeddings (vision analysis → tag extraction → vector embedding)
The status field (uploaded → processing → processed) lets the frontend show appropriate loading states. Posts in processing state show a shimmer placeholder. Failed processing sets an error status that triggers a notification to the uploader.
What I’d do differently
The four-rendition HLS setup is overkill for the current scale. Most viewers are on mobile with decent connections — 720p and 1080p would cover 95% of playback. Dropping 360p and 480p would halve transcoding time and storage.
Sharp’s fit: 'cover' works well for square and portrait content but occasionally crops important details from landscape images. A smarter approach would use Sharp’s attention-based cropping or run a lightweight saliency detection pass before deciding the crop region.
But the core architecture — preset-based processing, queue-driven workers, deterministic storage paths — has held up well. Adding a new image format means adding a preset. Adding a new video quality means adding a rendition config. The pipeline scales by adding workers, not by changing code.