An end to end impression ingestion pipeline

I’m building a platform where creators share posts with affiliate-linked products. When someone purchases through those links, we earn a commission. Better recommendations mean more clicks, more purchases, and more revenue.

But recommendations are only as good as the data feeding them. To recommend well, we need to understand how users behave — what they see, what they tap, what they ignore. That requires a system that captures every impression: every post rendered on screen, every product card viewed, every affiliate link clicked.

Impression data arrives fast. A single feed scroll generates dozens of events — at a few hundred active sessions, that’s thousands of impressions per minute. The Instagrams of the world solve this with distributed streaming, ClickHouse clusters, and dedicated analytics infrastructure. We’re not at that scale, and we don’t need that complexity yet.

What we need is something that handles mid-scale ingestion reliably without overengineering — a pipeline that captures impressions, buffers the burst, and persists them for downstream use. In this article, I’ll walk through how I built one using Redis, BullMQ, and Postgres.

Event ingestion runs in two stages. On the client, impressions are buffered and sent in batches every 5 seconds — or immediately when the tab or app backgrounds — including the media ID, source channel (follow, discover, trending, chronological), position in the feed, and timestamp.

On the server, each event is deduplicated against (sessionId, mediaId, position) before being appended to a Redis Stream. Only fresh events make it through. A worker then drains that stream every 15 seconds — so an event waits at most 15 seconds — bulk-inserts into Postgres, and acknowledges the processed entries.

Reliability model: what happens when things fail

This pipeline is designed for at-least-once delivery. An impression can be processed more than once, but it should never be silently lost.

We handle duplicates early using a dedup key based on (sessionId, mediaId, position), so retries from the client don’t create duplicate downstream work. Fresh events are appended to the stream; duplicates are dropped.

On the consumer side, entries are acknowledged only after a successful database write. If a worker crashes mid-run, unacknowledged entries remain pending and are reclaimed by another run. This keeps ingestion durable without putting synchronous write pressure on Postgres — an impression is durable in Postgres within ~0–15s after server ingest.

That said, this is at-least-once, so duplicates are still possible if a worker crashes after writing but before acknowledging. Redis dedup minimizes them on the way in; for full control, the Postgres write needs to be idempotent via a unique key on (sessionId, mediaId, position).

Event contract: what an impression looks like

Before building queues and retry logic, I locked down the event contract. At ingestion scale, consistency matters more than clever payloads. If every client sends the same shape, the backend can validate, deduplicate, and store predictably.

Each flush sends one session ID and a batch of impressions:

{
  "sessionId": "11111111-2222-3333-4444-555555555555",
  "events": [
    {
      "mediaId": "m_abc123",
      "source": "discover",
      "position": 7,
      "shownAt": "2026-04-16T12:00:05.123Z"
    }
  ]
}

Every field is intentional:

sessionId — groups behavior for both logged-out and logged-in users under one model
mediaId — the content that was shown
source — where it came from (follow, discover, trending, chronological), critical for feed attribution
position — rank in the feed at render time, used for analytics and dedup
shownAt — client-side timestamp of when the impression actually happened

The dedup identity is (sessionId, mediaId, position). Client retries won’t create duplicate stream entries for the same rendered slot. Fresh events move forward; duplicates get dropped early.

The contract is intentionally small. No metadata blobs, no optional noise, no coupling to frontend internals. Just enough to answer: what was shown, where it was shown, and in what order.

Flush mechanics: read, validate, write, ACK

The stream is only a buffer. Durability comes from the flush worker that runs every 15 seconds and drains events into Postgres in batches.

Each flush tick follows the same lifecycle:

Read a batch from the Redis Stream consumer group.
Validate required fields (sessionId, mediaId, source, position, shownAt) and drop malformed entries.
Write valid rows to Postgres with a bulk insert.
ACK the processed stream entries only after the database write succeeds.
Delete acknowledged entries from the stream to reclaim memory.

That ordering is the core safety property. If you acknowledge before writing, a crash loses data. If you write before acknowledging and then crash, data may be written twice on retry — acceptable under at-least-once delivery, as long as writes are idempotent.

There is one more guardrail: reclaiming stale pending entries. If a worker dies mid-run, unacknowledged events stay in the pending list and are claimed by the next run, so the pipeline keeps moving without manual intervention.

Engagement ingestion today: where each metric lands

At this point, it’s important to be explicit: engagement data does not flow through a single path. It enters the system through three complementary pipelines.

Feed telemetry path Feed cards emit engagement events through trackEngagement, which are queued and written to feed_engagement. This path carries attribution context (sessionId, source, position, occurredAt) and is used for feed-quality analysis.
Transactional social path (source of truth) Likes, saves, and comments are also written directly via the social API into transactional tables (like, saved_post, comment). These writes power product behavior immediately (counts, state, notifications) and trigger embedding updates.
Commerce click path Affiliate redirects (/api/go/:productId) are deduplicated and queued into product_click, which captures purchase-intent behavior for analytics and ranking.

This split is intentional: telemetry for attribution, transactional writes for correctness, and click tracking for commerce outcomes.

How this pipeline feeds recommendations

Impressions are exposure truth. They answer: what was shown, where it was shown, and in what order. By themselves, they don’t prove preference.

For feed analytics, exposure (feed_impression) is paired with attributed engagement (feed_engagement) to compute source-level engagement rate, position bias, and session-level conversion patterns.

For ranking, the strongest signals currently come from transactional interaction data (like, saved_post, comment) plus commerce-intent data (product_click). Those signals drive embedding and trending computations, which then influence retrieval and reranking.

So today the system is deliberately split:

feed_impression + feed_engagement -> attribution and experiment measurement
like + saved_post + comment + product_click -> primary recommendation features

That split keeps ingestion fast and observable while preserving stable, product-critical ranking signals.