No description
  • Go 90.9%
  • Shell 7.3%
  • HTML 1.6%
  • Dockerfile 0.2%
Find a file
Steffen Rörtgen 625dbe264b
All checks were successful
Build and Push Docker Image / build (push) Successful in 1m19s
fix(worker): backtick-wrap event_coord in FilterDelete
Real AMB d-tags are URLs; "30142:pub:https://example.org/?a=1&b=2"
would break Typesense's filter_by parser with "Could not parse the
filter query" (HTTP 400), losing chunks for re-published events. The
takedown path in admin.go already used backticks; align the worker
path and add a regression test using a real Edu-Sharing-style d-tag
seen during the oersi-dev backfill on 2026-05-15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-05 14:04:33 +02:00
.forgejo/workflows ci: drop registry buildcache export (returns 400 on Forgejo) 2026-05-06 17:20:43 +02:00
certs fix(indexer): trust GEANT TLS RSA 1 intermediate for rub.de fetches 2026-05-06 10:22:42 +02:00
cmd fix: address three bugs surfaced by the prod-mirror exercise 2026-05-12 10:13:13 +02:00
testdata/fixtures test: e2e integration covering HTML, PDF, and kind-30023 paths 2026-04-23 11:29:42 +02:00
.env.example Document Plan 2b HTTP surface, env vars, and refetch protocol 2026-04-23 14:16:27 +02:00
.env.indexer.example fix: address three bugs surfaced by the prod-mirror exercise 2026-05-12 10:13:13 +02:00
.gitignore chore(gitignore): ignore local Claude Code overrides 2026-05-12 14:48:28 +02:00
admin.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
admin_test.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
ambcache.go Read license:id with fallback, normalize CC URL license forms 2026-04-24 00:22:21 +02:00
ambcache_test.go Read license:id with fallback, normalize CC URL license forms 2026-04-24 00:22:21 +02:00
chunker.go fix: take overlap tails from pre-overlap chunks 2026-04-23 07:16:09 +02:00
chunker_test.go chore: final review fixups 2026-04-23 11:31:38 +02:00
CLAUDE.md docs(embed): point default EMBED_ENDPOINT at in-stack service 2026-05-06 09:45:37 +02:00
config.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
config_test.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
Dockerfile fix(indexer): trust GEANT TLS RSA 1 intermediate for rub.de fetches 2026-05-06 10:22:42 +02:00
e2e_test.sh Add /admin/dead_letter/replay_all with reason_substr filter 2026-04-23 21:39:44 +02:00
embed.go Send embedding requests as {"texts": [...]} per endpoint contract 2026-04-23 17:07:31 +02:00
embed_test.go Send embedding requests as {"texts": [...]} per endpoint contract 2026-04-23 17:07:31 +02:00
fetch.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
fetch_headless.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
fetch_headless_test.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
fetch_html.go fix: drop heading offsets past truncation cutoff 2026-04-22 19:11:10 +02:00
fetch_nostr.go chore: final review fixups 2026-04-23 11:31:38 +02:00
fetch_nostr_test.go feat: kind-30023 fetcher via nostr REQ 2026-04-22 20:06:19 +02:00
fetch_pdf.go fix: honor TikaClient.Timeout via context deadline 2026-04-22 19:17:08 +02:00
fetch_test.go cleanup: drop redundant regex scan and use json.Marshal in nostr test helper 2026-04-22 20:09:47 +02:00
go.mod feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
go.sum feat(mcp): wire stdio main with flag-driven config 2026-05-05 20:29:59 +02:00
HEADLESS_FETCH_SPEC.md feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
http_server.go Add GET /ui comparison page 2026-04-24 00:31:06 +02:00
http_server_test.go Add GET /ui comparison page 2026-04-24 00:31:06 +02:00
license.go Read license:id with fallback, normalize CC URL license forms 2026-04-24 00:22:21 +02:00
license_test.go Read license:id with fallback, normalize CC URL license forms 2026-04-24 00:22:21 +02:00
main.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
metrics.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
metrics_test.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
nostr_source.go fix(backfill): terminate only on empty page, not short pages 2026-05-15 12:11:47 +02:00
nostr_source_test.go fix(backfill): terminate only on empty page, not short pages 2026-05-15 12:11:47 +02:00
politeness.go feat: per-host concurrency and min-interval limiter 2026-04-22 18:15:30 +02:00
politeness_test.go chore: final review fixups 2026-04-23 11:31:38 +02:00
readiness.go Wire HTTP server, refetch poller, and readiness probe into main.go 2026-04-23 14:09:22 +02:00
readiness_test.go Wire HTTP server, refetch poller, and readiness probe into main.go 2026-04-23 14:09:22 +02:00
README.md fix: address three bugs surfaced by the prod-mirror exercise 2026-05-12 10:13:13 +02:00
refetch.go Wire HTTP server, refetch poller, and readiness probe into main.go 2026-04-23 14:09:22 +02:00
refetch_test.go Wire HTTP server, refetch poller, and readiness probe into main.go 2026-04-23 14:09:22 +02:00
relay_client.go fix: address three bugs surfaced by the prod-mirror exercise 2026-05-12 10:13:13 +02:00
relay_client_test.go fix: address three bugs surfaced by the prod-mirror exercise 2026-05-12 10:13:13 +02:00
schema.go feat: thin Typesense client and chunks schema 2026-04-23 07:26:54 +02:00
schema_test.go feat: thin Typesense client and chunks schema 2026-04-23 07:26:54 +02:00
search.go Normalize /search_chunks scores to 0..1 2026-04-23 21:34:47 +02:00
search_test.go Normalize /search_chunks scores to 0..1 2026-04-23 21:34:47 +02:00
ssrf.go feat: SSRF guard with CIDR, host, TLD deny lists 2026-04-22 18:03:25 +02:00
ssrf_test.go feat: SSRF guard with CIDR, host, TLD deny lists 2026-04-22 18:03:25 +02:00
store.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
store_test.go fix: address three bugs surfaced by the dev.oersi headless deploy 2026-05-12 16:39:17 +02:00
tsclient.go Add TSClient.Search for hybrid keyword+vector queries 2026-04-23 13:44:28 +02:00
tsclient_test.go Add TSClient.Search for hybrid keyword+vector queries 2026-04-23 13:44:28 +02:00
ui.go Add GET /ui comparison page 2026-04-24 00:31:06 +02:00
ui.html Add GET /ui comparison page 2026-04-24 00:31:06 +02:00
ui_test.go Add GET /ui comparison page 2026-04-24 00:31:06 +02:00
worker.go fix(worker): backtick-wrap event_coord in FilterDelete 2026-06-05 14:04:33 +02:00
worker_headless.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
worker_headless_test.go feat(fetch): three-tier headless render fallback for thin/failed fetches 2026-05-12 14:48:57 +02:00
worker_test.go fix(worker): backtick-wrap event_coord in FilterDelete 2026-06-05 14:04:33 +02:00

amb-indexer

What is this

amb-indexer is a standalone Go service that consumes kind-30142 AMB (Learning Resource Metadata) events from an amb-relay, fetches the referenced resource (HTML, PDF, or kind-30023 long-form article), extracts normalized text with structural metadata, chunks and embeds it, writes chunks to a Typesense collection, and writes fulltext back to the relay via NIP-86 setcontent.

It also serves POST /search_chunks (hybrid BM25+vector search with AMB metadata enrichment) and a small admin surface for dead-letter replay, force-refetch, and takedown.

How this fits with amb-relay

amb-indexer is a paired service — it requires a running amb-relay on the other end. The relay accepts 30142 events; the indexer subscribes to them, fetches/extracts/chunks/embeds the referenced resources, and writes fulltext back via NIP-86 setcontent. Chunks go to a separate Typesense collection (amb_chunks_30142) which the indexer also serves over HTTP at /search_chunks.

For the full system diagram and operator guide see amb-relay/docs/architecture.md. For the relay's NIP-86 surface (including the setcontent / refetchcontent / listrefetch / acknowledgerefetch methods used by this service) see amb-relay/README.md.

Run locally

cp .env.example .env
# edit .env — at minimum set INDEXER_NSEC
go run .

Tests:

go test ./...

Configuration

All configuration is via environment variables. See .env.example for the full list with defaults. The required variables are:

  • RELAY_URL — WebSocket URL of the amb-relay
  • RELAY_NIP86_URL — HTTP base URL for NIP-86 calls against the relay
  • INDEXER_NSEC — nsec the indexer signs NIP-98 and setcontent events with
  • TS_HOST, TS_APIKEY — Typesense endpoint and API key
  • EMBED_ENDPOINT — HTTP endpoint of the embedding service
  • TIKA_URL — base URL of the Apache Tika server (for PDFs)

Authorization: The pubkey derived from INDEXER_NSEC must be present in the relay's ADMIN_PUBKEYS (or be the relay's own PUBKEY). Without this the relay rejects setcontent calls and no fulltext lands. See amb-relay's ADMIN_PUBKEYS.

Run in docker compose

The amb-relay repo's docker-compose.yml wires amb-indexer alongside the relay, Typesense, and Tika. From the amb-relay directory:

cp ../amb-indexer/.env.indexer.example ../amb-indexer/.env.indexer
# edit .env.indexer — set INDEXER_NSEC at minimum
docker compose up --build

Deployment via published image

Forgejo CI publishes images to git.edufeed.org/edufeed/amb-indexer on push to main and on v* tags. Tag scheme:

  • main — latest commit on main
  • vX.Y.Z, vX.Y — semver releases
  • <short-sha> — pinned by commit

To consume the image instead of building from source, edit the amb-indexer service in amb-relay/docker-compose.yml:

- build: ../amb-indexer
+ image: git.edufeed.org/edufeed/amb-indexer:main

You can then run amb-relay's stack without cloning amb-indexer at all.

Where the data lives: the indexer_data named volume holds /data/indexer.db (BoltDB: cursor, event-hash, dead-letter tables).

How to check progress: docker compose logs -f amb-indexer for the processing log; query Typesense directly to see chunks:

curl -H "X-TYPESENSE-API-KEY: $TS_APIKEY" \
  "$TS_HOST/collections/amb_chunks_30142/documents/search?q=*&per_page=1"

Refetch lifecycle

Operators can mark events for re-ingestion via the relay's NIP-86 refetchcontent method. The relay records the event id in a BoltDB bucket; this indexer polls listrefetch every REFETCH_POLL_INTERVAL seconds, re-fetches the events it gets back, and calls acknowledgerefetch to clear the ids it accepted. The queue survives indexer downtime — restart the indexer and pending refetches resume on the next poll.

Mirror events from another relay

cmd/mirror-prod streams kind-30142 events from a source relay (default wss://amb-relay.edufeed.org) and republishes them verbatim to a destination relay. It paginates REQs by cursor (until = oldest_seen) and dedupes by id across pages, so it works against relays that cap per-REQ replies (prod khatru caps at 250).

Use it to stage prod corpus into a local stack for side-by-side search comparison, or to backfill a fresh relay before pointing the indexer at it.

cd cmd/mirror-prod
go run . -src wss://amb-relay.edufeed.org -dst ws://localhost:3334

Flags:

Flag Default Notes
-src wss://amb-relay.edufeed.org source relay ws URL
-dst ws://localhost:3334 destination relay ws URL
-since 0 unix timestamp lower bound (0 = full corpus)
-page-size 500 per-REQ event cap (source may cap lower)
-max-pages 200 safety cap on pagination loop
-max-events 0 overall cap (0 = unlimited)
-dry-run false subscribe and count, do not publish
-state-file (empty) path to persistent resume checkpoint; deleted on clean exit
-empty-page-retries 3 retries before treating an empty page as end-of-stream
-empty-page-retry-delay 2s sleep between empty-page retries

Common recipes:

# Full mirror, prod → local
go run ./cmd/mirror-prod

# Incremental top-up: only events newer than a known timestamp
go run ./cmd/mirror-prod -since 1798000000

# Probe without writing anything
go run ./cmd/mirror-prod -dry-run

# Local → local (e.g. staging another dev environment)
go run ./cmd/mirror-prod \
    -src ws://localhost:3334 \
    -dst ws://localhost:3335

# Resumable mirror — survives Ctrl-C, websocket drops, transient empties.
# Re-running with the same -state-file picks up where the previous run
# left off; the file is removed on clean completion.
go run ./cmd/mirror-prod -state-file /tmp/mirror.json

Re-runs are idempotent: relays dedupe by event id, the indexer dedupes by event_id → sha256(url).

What does not travel with the events:

  • No re-signing. Events are republished verbatim, so destination signature verification accepts them. If the source author was a banned pubkey on the destination, those events will be rejected.
  • No content / chunks / embeddings. Only the kind-30142 events themselves move. The indexer attached to the destination relay must re-fetch each resource and rebuild content + chunks. Plan for a full re-ingest run after a large mirror.
  • No kind-5 deletions (only -kinds defaults to [30142]). If the source has soft-deletions you want to honour, mirror those separately or use NIP-77 (Negentropy) instead.
  • No NIP-86 admin state (bans, allowlists, schema, semantic config). These live in BoltDB on each relay and are intentionally per-instance.

Alternatives

  • NIP-77 Negentropy. Subscribe with negentropy: true and let the protocol reconcile event sets. Cheaper than mirror-prod for small diffs. ⚠️ wss://amb-relay.edufeed.org currently caps Negentropy sessions at 250 events, so it cannot pull the full corpus this way — use mirror-prod for full backfills.
  • NIP-86 reindex. Local-only — drops the destination's Typesense collection and rebuilds from its own BoltDB. Useful after a mirror to ensure the search index reflects the freshly-arrived events, but does not move events between relays.
  • Raw BoltDB copy. If both relays are under your control and identical, copying relay.db while both are stopped is the fastest full clone — but skips the relay's accept-side validation, so use only between trusted instances.

HTTP API

When INDEXER_LISTEN is set (default :8080), the indexer exposes a small HTTP surface. Two bearer tokens gate access: INDEXER_API_TOKEN for search, INDEXER_ADMIN_TOKEN for admin. /health, /ready, and /metrics are unauthenticated.

curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"q":"photosynthesis","k":5}' \
     http://localhost:8080/search_chunks | jq .

Response shape (one hit, trimmed):

{
  "total": 22,
  "hits": [
    {
      "chunk_id":    "054d419f...:96",
      "event_id":    "054d419f...",
      "event_coord": "30142:4686...:plan2b-test-html-1776955403",
      "chunk_idx":   96,
      "text":        "...full chunk text...",
      "snippet":     "...leading sentence...",
      "heading":     "References",
      "section_path":["Photosynthesis","References"],
      "source_url":  "https://en.wikipedia.org/wiki/Photosynthesis",
      "score":       1.0,
      "amb": {
        "name": "Photosynthesis (Wikipedia)",
        "description": "...",
        "license": "CC-BY-SA",
        "learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"]
      }
    }
  ]
}

score is normalized to [0, 1] (hybrid rank fusion when available, else vector-cosine fallback, else relative text_match). Non-permissive chunks have text elided but retain snippet so callers can still render a preview.

Filter by AMB metadata:

curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "q": "climate",
       "k": 10,
       "filter": {
         "learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"],
         "license_permissive": true
       }
     }' \
     http://localhost:8080/search_chunks | jq .

Admin

# List dead-letter entries
curl -sS -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/dead_letter | jq .

# Replay a single dead-lettered event
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/dead_letter/<event_id>/replay

# Bulk replay — optional reason_substr filter prevents accidentally
# replaying unrelated categories (e.g. redactions). Response:
# {total, replayed, skipped, failed, errors}.
curl -sS -X POST \
     -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"reason_substr":"no source URL","limit":100}' \
     http://localhost:8080/admin/dead_letter/replay_all

# Force re-ingest an event (also arms relay-side needs_refetch)
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/refetch/<event_id>

# Takedown — delete chunks + mark relay content redacted + dead-letter
curl -sS -X DELETE -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/event/<event_id>

# Reset the source-subscription cursor and force the source to re-subscribe.
# Use after bulk-loading historical events into the relay (mirror, restore,
# import) so the indexer picks them up. timestamp=0 reprocesses everything;
# any past unix timestamp catches up from that point. The event_hash cache is
# intentionally NOT cleared — already-indexed events short-circuit, so a full
# reset is cheap.
curl -sS -X POST \
     -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"timestamp":0}' \
     http://localhost:8080/admin/cursor

Probes

curl -sS http://localhost:8080/health        # liveness: 200 always
curl -sS http://localhost:8080/ready         # readiness: 200 iff deps fresh
curl -sS http://localhost:8080/metrics       # Prometheus text

Readiness returns 503 if Typesense or the relay hasn't been marked healthy within 60s. The TS prober runs every 15s; relay freshness is stamped by the refetch poller's listrefetch call (default every 30s).

MCP server (LLM context source)

cmd/mcp is a standalone binary that exposes the indexer's /search_chunks engine as an MCP tool named search_educational_chunks. LLM clients (Claude Code, Claude Desktop, Cursor, etc.) can register the indexer as a context source in one config line.

Build:

go build -o ./bin/amb-indexer-mcp ./cmd/mcp

The binary speaks MCP over stdio. It needs:

  • -indexer-url (or INDEXER_URL) — defaults to http://localhost:8080
  • -indexer-token (or INDEXER_API_TOKEN) — same value the operator configured on the indexer for /search_chunks. Required.

Register with Claude Code

In your Claude Code config (~/.claude/mcp.json or per-project):

{
  "mcpServers": {
    "amb-indexer": {
      "command": "/abs/path/to/amb-indexer-mcp",
      "args": ["-indexer-url", "http://localhost:8080"],
      "env": {
        "INDEXER_API_TOKEN": "your-token-here"
      }
    }
  }
}

Restart Claude Code. The search_educational_chunks tool will appear in the tool list. Ask "find me OER worksheets about photosynthesis" and the model will call the tool with the appropriate filter.

Tool schema

search_educational_chunks({
  "query": "natural-language query",       // required
  "k": 10,                                 // optional, server-capped at 100
  "filter": {                              // optional, all fields optional
    "about_id":              ["https://w3id.org/kim/schulfaecher/s1017"],
    "learning_resource_type":["https://w3id.org/kim/hcrt/worksheet"],
    "license":               ["CC-BY-4.0"],
    "license_permissive":    true
  }
})
// → {"total": N, "hits": [ ... same shape as POST /search_chunks ... ]}

The tool's response schema is identical to the HTTP endpoint's SearchResponse. Non-permissive chunks have text elided but retain snippet so the LLM can still ground its answer on a fair-use fragment.

Operator notes

  • The compose file keeps port 8080 on the internal docker network only. Add an override to expose it locally, e.g. ports: ["127.0.0.1:8080:8080"].
  • Generate tokens with openssl rand -hex 32. Both tokens are required whenever INDEXER_LISTEN is non-empty; set INDEXER_LISTEN="" for a pure-ingest deployment.