Skip to content

Integrate openai moderation endpoint#4

Open
lukesalamone wants to merge 1 commit into
BaderBC:masterfrom
lukesalamone:moderation
Open

Integrate openai moderation endpoint#4
lukesalamone wants to merge 1 commit into
BaderBC:masterfrom
lukesalamone:moderation

Conversation

@lukesalamone
Copy link
Copy Markdown

No description provided.

@BaderBC
Copy link
Copy Markdown
Owner

BaderBC commented May 9, 2026

@coderabbitai

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Tip

For best results, initiate chat on the files or code changes.

@BaderBC Hey! It looks like your comment came through without a message. How can I help? Here are some things I can do:

  • Review the PR — trigger a full review of the changes.
  • Answer questions — ask me anything about the code or the OpenAI moderation integration.
  • Generate code — help implement or improve specific parts of the PR.
  • Create an issue — open a follow-up GitHub issue for anything out of scope.

Just let me know what you need! 😊

@BaderBC
Copy link
Copy Markdown
Owner

BaderBC commented May 9, 2026

@coderabbitai review in holistic perspective, taking into account all the ways we are currently moderating articles and comments - as I suspect we could have rn too much methods of moderation working at the same time together.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me dig into the codebase to map out all moderation methods before giving you a holistic perspective.


🏁 Script executed:

#!/bin/bash
echo "=== Files with moderation-related code ==="
rg -rn --type=ts -l "moderat" .

echo ""
echo "=== All moderation-related lines ==="
rg -rn --type=ts -C3 "moderat" .

echo ""
echo "=== Article/comment validation or filtering logic ==="
rg -rn --type=ts -C3 "badword|profan|filter|sanitiz|censor|toxicit|flagg" . --ignore-case

echo ""
echo "=== OpenAI usage ==="
rg -rn --type=ts -C3 "openai" . --ignore-case

echo ""
echo "=== Package.json to see dependencies ==="
cat package.json 2>/dev/null || fd "package.json" --max-depth=2 --exec cat {}

Length of output: 44025


Let me also check the exact diff of the PR to separate what's new vs what pre-existed.


🏁 Script executed:

#!/bin/bash
echo "=== Git diff of the PR (comparing to main/master) ==="
git log --oneline -10

echo ""
echo "=== Changed files ==="
git diff HEAD~1 --name-only 2>/dev/null || git diff origin/main...HEAD --name-only 2>/dev/null

echo ""
echo "=== Full diff of moderation.ts ==="
git diff HEAD~1 -- src/worker/moderation.ts 2>/dev/null || git diff origin/main...HEAD -- src/worker/moderation.ts 2>/dev/null

echo ""
echo "=== Full diff of comments.ts ==="
git diff HEAD~1 -- src/worker/comments.ts 2>/dev/null || git diff origin/main...HEAD -- src/worker/comments.ts 2>/dev/null

echo ""
echo "=== Full diff of index.ts ==="
git diff HEAD~1 -- src/worker/index.ts 2>/dev/null || git diff origin/main...HEAD -- src/worker/index.ts 2>/dev/null

echo ""
echo "=== Full moderation.ts for complete picture ==="
cat src/worker/moderation.ts

echo ""
echo "=== Article moderation path in index.ts (article generation area) ==="
rg -n "openaiModerate|openAI|OPENAI|openai|banSlugNow|isPermanentlyBlocked|isSlugBanned|countRecentBans|enqueueArticle|strikeThreshold" src/worker/index.ts -A2 -B2

Length of output: 35881


@BaderBC Here's a holistic review of the full moderation stack across articles and comments, with this PR's additions in context.


Complete Moderation Stack Map

Articles (GET /api/page/:slug, uncached path)

Request arrives
    │
    ▼
① isPermanentlyBlockedSlug()       — static regex blocklist (blocklist.ts), no DB, no LLM
    │
    ▼
② isSlugBanned()                   — DB check: previously banned slugs can't regenerate
    │
    ▼
③ countRecentBansByIp()            — IP strike gate: ≥3 bans/24 h → 429 before LLM spend
    │
    ▼
④ openaiModerate(title) [NEW]      — generic omni-moderation-latest, synchronous, title only
    │ flagged? → banSlugNow(waitUntil) + 404
    ▼
⑤ LLM generation (OpenRouter)
    │
    ▼
⑥ enqueueArticleForModeration()   → periodic /api/moderate sweep
                                      (30-item batches, Halupedia-tuned custom LLM prompt)

Comments (POST /api/comments/:slug)

Comment submitted
    │
    ▼
① openaiModerate(body) [NEW]       — generic, synchronous, rejects immediately if flagged
    │
    ▼
② Comment saved, 200 returned
    │
    ▼
③ moderateCommentNow (waitUntil)   — isObviousCommentSpam() regex only; if clean, leaves as 'pending'
    │                                 (no LLM call here)
    ▼
④ Periodic /api/moderate sweep     — spam detector pre-filter, then custom LLM on remainder

Issues & Redundancies

🔴 Critical: Model mismatch creates false positives on legitimate satire

The openaiModerate call uses json?.results?.[0]?.flagged === true — the aggregate flag that ORs together all 15+ OpenAI categories (sexual, hate, harassment, self-harm, violence, etc.). Any single subcategory hit blocks the request.

Meanwhile your custom MODERATION_SYSTEM_PROMPT explicitly allows:

"Vulgarity, profanity, dark humor, mockery of fictional characters"
"Edgy, offensive, in-poor-taste jokes that target ideas, not people"

A satirical title like "The Great Fornication Act of 1742" or a dark-humored comment would likely be flagged by omni-moderation-latest for sexual content — but your tuned LLM would correctly allow it. The two models are operating with directly contradictory standards. This risks silently blocking legitimate submissions.

Fix: Instead of reading the top-level flagged boolean, check only the narrow categories you actually care about:

export async function openaiModerate(text: string, apiKey: string): Promise<boolean> {
  try {
    const res = await fetch("https://api.openai.com/v1/moderations", {
      method: "POST",
      headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
      body: JSON.stringify({ model: "omni-moderation-latest", input: text }),
    });
    if (!res.ok) return false;
    const json: any = await res.json();
    const cats = json?.results?.[0]?.categories;
    if (!cats) return false;
    // Only block on categories the custom LLM sweep also removes.
    // Do NOT block on: sexual, harassment (those are allowed for satire).
    return (
      cats["hate"] === true ||
      cats["hate/threatening"] === true ||
      cats["sexual/minors"] === true ||
      cats["violence/graphic"] === true ||
      cats["illicit/violent"] === true
    );
  } catch {
    return false;
  }
}

🔴 banSlugNow doesn't store the IP — bans don't contribute to IP strikes

banSlugNow writes to article_moderation without created_ip. The countRecentBansByIp query filters on created_ip = ?, so OpenAI-triggered bans are invisible to the IP strike system. A bot can have all its article titles flagged by OpenAI and still never accumulate a strike.

Fix: Pass the IP through from the handler:

// moderation.ts
export async function banSlugNow(slug: string, env: ModerationEnv, ip?: string | null): Promise<void> {
  const now = Date.now();
  try { await env.ARTICLES.delete(slug); } catch {}
  try {
    await env.DB
      .prepare(
        `INSERT INTO article_moderation (slug, status, reason, enqueued_at, checked_at, created_ip)
         VALUES (?, 'banned', ?, ?, ?, ?)
         ON CONFLICT(slug) DO UPDATE SET
           status='banned', reason=excluded.reason,
           checked_at=excluded.checked_at,
           created_ip=COALESCE(created_ip, excluded.created_ip)`
      )
      .bind(slug, "openai-moderation-precheck", now, now, ip ?? null)
      .run();
  } catch (e) {
    console.error("banSlugNow: DB write failed", slug, e);
  }
}
// index.ts — in the openaiModerate block
c.executionCtx.waitUntil(
  banSlugNow(slug, c.env, ip).catch((e) =>   // pass `ip` here
    console.error("banSlugNow failed", e)
  )
);

🟡 Comments: OpenAI sync check vs. custom LLM sweep — genuine redundancy

For comments, both ① (OpenAI generic sync) and ④ (custom LLM sweep) evaluate the same content (the full body) against conflicting policies. If OPENAI_API_KEY is set:

  • A comment like "This article is absolute shit lmao" → OpenAI may flag harassment; custom LLM would clearly allow it.
  • A comment with engagement-bait spam fingerprint → isObviousCommentSpam catches it deterministically without any API call.

The original design (spam regex → LLM sweep) was already coherent and well-calibrated for the site. The OpenAI layer adds a faster sync gate but at the cost of over-blocking satire/vulgarity in comments. Consider applying the category-filtering fix above here too, or scope the comment check to only the narrowest categories.


🟡 Comment visibility gap remains

moderateCommentNow (waitUntil) only runs isObviousCommentSpam. Non-spam policy-violating comments are visible until the next /api/moderate sweep fires. The OpenAI sync check partially plugs this for the categories it covers, but with the mismatch problem above, you're trading accuracy for speed.


🟢 The overall layering logic is sound

The ordering is correct and efficient: static blocklist (free) → DB guard (cheap) → IP strike (cheap) → OpenAI title check (fast, pre-LLM-spend) → LLM generation → async sweep (amortized). The fail-open behavior on OpenAI errors (catch { return false }) is the right default. The deterministic spam detector before the LLM in sweep batches is a good cost-saving pattern.


Summary

You do have two overlapping content policy systems running simultaneously — omni-moderation-latest (generic, permissive-unfriendly) and your custom LLM prompt (context-aware, permissive by design). The fix isn't to remove either layer, but to scope OpenAI to only the narrow categories your custom prompt also removes (hate/CSAM/violent incitement) and ignore the broad ones (sexual, harassment) that you intentionally allow. Also wire up the IP to banSlugNow so the strike system works end-to-end.

@LynnColeArt
Copy link
Copy Markdown

I pushed a conflict-clean replacement for this moderation PR as #17 in case that helps unblock review. It includes the same moderation behavior from the original intent and resolves the merge conflict against current master by preserving both and config.

@HaplessIdiot
Copy link
Copy Markdown

HaplessIdiot commented May 17, 2026

really not a fan of censorship of the halupedia why are we having people come in and make useless moderation pr when the site is working as intended? Why not focus on FEATURES not CENSORSHIP? a dark mode for the website would be a nice start there are plenty of better things to do than this or the other moderation PRs clogging up things now. Discord Mods dont need to be involved their powertripping will kill the vibes of the site quick. also its easy to fork... dont force it. If all machine learning safety analysts are like this guy im not shocked half of them got laid off this is a total waste of time coding this. its a fun site for fun its not serious. its no wonder grammarly barely works when they got this dork at the helm xD at least the lead was able to salvage this junk into #17 but hopefully we dont go off the deepend with control of the platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants