hosenur
← Back

Show Once and Never Again, Implementing Bloom Filter

Minwin’s feed is a multi-phase pipeline — followed creators, vector-similar discovery, chronological fallback. Across all three phases, there’s one hard requirement: a user should never see the same post twice. Not in the same session, not across sessions, not when switching between phases.

The naive approach is a database table: user_id, post_id, seen_at. Query it before every feed response, insert after every impression. At scale, this table explodes — if 10K users each see 200 posts per day, that’s 2 million rows daily. Queries against it get slower, indexes get larger, and you’re burning database I/O on something that doesn’t need to be exact.

A bloom filter solves this perfectly.

What a bloom filter does

A bloom filter is a probabilistic data structure that answers one question: “Have I seen this before?” It can tell you:

It never produces false negatives. If you added an item, the bloom filter will always say yes. It occasionally says yes for items you never added (false positives), but you can control how often that happens.

The trade-off: a bloom filter uses far less memory than storing every seen item explicitly, and lookups are O(1) regardless of how many items have been added.

For a feed, this trade-off is ideal. A false positive means a user occasionally misses a post they haven’t seen — barely noticeable. A false negative would mean showing a duplicate — much worse. Bloom filters guarantee no false negatives.

Redis bloom filter setup

I’m using Redis with the RedisBloom module, which provides native bloom filter commands. Each user gets their own filter:

await redis.sendCommand([
  'BF.RESERVE',
  `seen:${userId}`,
  '0.01',    // 1% false positive rate
  '100000'   // capacity: 100K items
]);

The parameters:

Each filter gets a 30-day TTL. After 30 days of inactivity, the filter expires and the user starts fresh. This keeps Redis memory bounded and aligns with the practical reality that content older than 30 days is rarely surfaced in the feed anyway.

Checking before serving

Before returning any batch of posts to the client, the feed service checks the bloom filter in bulk:

const results = await redis.sendCommand([
  'BF.MEXISTS',
  `seen:${userId}`,
  ...postIds
]);

BF.MEXISTS checks multiple items in a single round trip. It returns an array of 0s and 1s — 0 means definitely not seen, 1 means probably seen. Any post that returns 1 gets filtered out of the response.

This is why the feed fetches 3× the requested limit from each phase. If the client asks for 20 posts, I query 60 from the database, run them through the bloom filter, and return the first 20 that pass. The 3× multiplier accounts for the filtering loss — in practice, the hit rate varies by user, but 3× has been a reliable buffer.

Marking as seen

After the feed response is sent to the client, the seen posts are marked in the bloom filter:

await redis.sendCommand([
  'BF.MADD',
  `seen:${userId}`,
  ...returnedPostIds
]);

BF.MADD adds multiple items in a single command. This call is fire-and-forget — it runs after the response is already on its way to the client. If it fails (Redis hiccup, network blip), the worst case is the user sees a post twice in a future request. That’s acceptable.

The fire-and-forget pattern matters for latency. The bloom filter write doesn’t block the response. The user gets their feed immediately, and the bookkeeping happens asynchronously.

How it fits into the feed pipeline

The bloom filter sits between the data fetch and the response in every feed phase:

Phase 1 — Followed creators: Fetch recent posts from followed users, sorted by recency. Filter through bloom. Interleave trending posts every 5 items (also bloom-checked).

Phase 2 — Discovery: pgvector cosine similarity query finds the closest content to the user’s embedding. Fetch 3× limit, bloom-filter, return the survivors.

Phase 3 — Chronological fallback: Newest posts globally, excluding followed creators (already covered in phase 1). Bloom-filter before returning.

The cursor-based pagination carries state across phases. When phase 1 is exhausted, the cursor seamlessly transitions the client into phase 2, then phase 3. The bloom filter is the one constant — every post, from every phase, gets checked and marked.

Why not a database set or Redis set

A Redis set (SADD / SISMEMBER) would work and be simpler to reason about. But the memory cost is significant — storing 100K post IDs as strings in a Redis set takes roughly 6-8 MB per user. A bloom filter with the same capacity and 1% error rate uses about 120 KB. That’s a 50× reduction.

At 10K active users, that’s the difference between 60-80 GB and 1.2 GB of Redis memory. The bloom filter makes per-user deduplication feasible at scale without dedicated infrastructure.

A database table with a unique index on (user_id, post_id) would also work but adds query latency to every feed request and creates a write-heavy table that needs regular cleanup. The bloom filter keeps this entirely in memory with O(1) operations.

The 1% trade-off in practice

A 1% false positive rate means roughly 1 in 100 posts gets incorrectly marked as “seen” and filtered out. For a user scrolling through hundreds of posts per session, they might miss 2-3 posts they haven’t actually seen. In a feed context, this is invisible — the user doesn’t know those posts exist, and the feed still has plenty of fresh content to show.

If the false positive rate were higher — say 10% — users would start noticing gaps, especially power users who exhaust the follow phase quickly and rely on discovery for fresh content. At 1%, the filter is practically invisible.

What I’d improve

The fixed 100K capacity is a guess. Heavy users who scroll for hours might approach that limit within the 30-day TTL, at which point the false positive rate degrades beyond the configured 1%. A scaling bloom filter (Redis supports BF.RESERVE with expansion) would handle this more gracefully — it adds sub-filters as capacity fills, maintaining the target error rate at the cost of slightly more memory.

The 30-day TTL is also static. Users who visit daily would benefit from a shorter TTL (their filter fills faster), while infrequent users could keep a longer one. Adaptive TTL based on activity patterns would optimize memory usage.

But for the current scale, the simple setup works. One filter per user, 100K capacity, 1% error rate, 30-day expiry. It handles deduplication across all three feed phases, adds zero perceptible latency, and costs almost nothing in memory. The post appears once, and never again.