No description

Go 92%
Shell 6.4%
HTML 1.4%
Dockerfile 0.2%

Find a file

Steffen Rörtgen bf9c4c6e63 All checks were successful Build and Push Docker Image / build (push) Successful in 1m27s Details Merge feat/retire-30145: drop kind 30145 — publications arrive as NKBIP-01 30040 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>		2026-07-16 20:57:00 +02:00
.forgejo/workflows	ci: drop registry buildcache export (returns 400 on Forgejo)	2026-05-06 17:20:43 +02:00
certs	fix(indexer): trust GEANT TLS RSA 1 intermediate for rub.de fetches	2026-05-06 10:22:42 +02:00
cmd	fix(indexer): adapt to upstream nostrlib; bump pin	2026-06-16 16:19:03 +02:00
testdata/fixtures	test: e2e integration covering HTML, PDF, and kind-30023 paths	2026-04-23 11:29:42 +02:00
.env.example	Document Plan 2b HTTP surface, env vars, and refetch protocol	2026-04-23 14:16:27 +02:00
.env.indexer.example	feat(license): exempt NKBIP-01 publication kinds from the license gate	2026-07-16 16:18:33 +02:00
.gitignore	chore: git-ignore .worktrees/ for local development isolation	2026-06-25 08:25:05 +02:00
admin.go	feat(indexer): auto-replay transient embed dead-letters on a health gate	2026-06-18 07:29:41 +02:00
admin_test.go	fix: address three bugs surfaced by the dev.oersi headless deploy	2026-05-12 16:39:17 +02:00
ambcache.go	feat(indexer): chunk kind-30023 content directly, kind-derived coords	2026-06-16 16:08:54 +02:00
ambcache_test.go	Read license:id with fallback, normalize CC URL license forms	2026-04-24 00:22:21 +02:00
autoreplay.go	chore: scrub stale 384/MiniLM references in test + comment	2026-06-29 14:49:34 +02:00
autoreplay_test.go	fix(indexer): classify Docker DNS recycle errors as transient embed failures	2026-06-18 07:54:31 +02:00
chunker.go	fix: take overlap tails from pre-overlap chunks	2026-04-23 07:16:09 +02:00
chunker_test.go	chore: final review fixups	2026-04-23 11:31:38 +02:00
CLAUDE.md	feat(transferkiosk)!: drop kind 30145 — publications arrive as NKBIP-01 30040	2026-07-16 19:15:17 +02:00
config.go	feat(license): exempt NKBIP-01 publication kinds from the license gate	2026-07-16 16:18:33 +02:00
config_test.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
Dockerfile	fix(indexer): trust GEANT TLS RSA 1 intermediate for rub.de fetches	2026-05-06 10:22:42 +02:00
e2e_test.sh	Add /admin/dead_letter/replay_all with reason_substr filter	2026-04-23 21:39:44 +02:00
embed.go	feat(indexer): send input_type to embed service; chunk embedding 768-dim	2026-06-29 14:27:31 +02:00
embed_test.go	chore: scrub stale 384/MiniLM references in test + comment	2026-06-29 14:49:34 +02:00
fetch.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
fetch_headless.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
fetch_headless_test.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
fetch_html.go	fix: drop heading offsets past truncation cutoff	2026-04-22 19:11:10 +02:00
fetch_nostr.go	feat(transferkiosk)!: drop kind 30145 — publications arrive as NKBIP-01 30040	2026-07-16 19:15:17 +02:00
fetch_nostr_test.go	feat(transferkiosk)!: drop kind 30145 — publications arrive as NKBIP-01 30040	2026-07-16 19:15:17 +02:00
fetch_pdf.go	fix: honor TikaClient.Timeout via context deadline	2026-04-22 19:17:08 +02:00
fetch_test.go	cleanup: drop redundant regex scan and use json.Marshal in nostr test helper	2026-04-22 20:09:47 +02:00
go.mod	chore(deps): bump nostrlib to e5-embedding-768 (input_type + 768-dim)	2026-06-29 14:55:13 +02:00
go.sum	chore(deps): bump nostrlib to e5-embedding-768 (input_type + 768-dim)	2026-06-29 14:55:13 +02:00
HEADLESS_FETCH_SPEC.md	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
http_server.go	feat(indexer): log underlying error on /search_chunks 500	2026-06-18 08:07:26 +02:00
http_server_test.go	feat(indexer): log underlying error on /search_chunks 500	2026-06-18 08:07:26 +02:00
license.go	feat(license): exempt NKBIP-01 publication kinds from the license gate	2026-07-16 16:18:33 +02:00
license_test.go	feat(license): exempt NKBIP-01 publication kinds from the license gate	2026-07-16 16:18:33 +02:00
main.go	feat(transferkiosk)!: drop kind 30145 — publications arrive as NKBIP-01 30040	2026-07-16 19:15:17 +02:00
main_test.go	fix(bootstrap): retry EnsureCollection during Typesense WAL replay	2026-06-07 10:17:34 +02:00
metrics.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
metrics_test.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
nostr_source.go	fix(backfill): paginate each kind independently to prevent cross-kind starvation	2026-06-17 20:27:29 +02:00
nostr_source_test.go	fix(backfill): paginate each kind independently to prevent cross-kind starvation	2026-06-17 20:27:29 +02:00
politeness.go	feat: per-host concurrency and min-interval limiter	2026-04-22 18:15:30 +02:00
politeness_test.go	chore: final review fixups	2026-04-23 11:31:38 +02:00
readiness.go	Wire HTTP server, refetch poller, and readiness probe into main.go	2026-04-23 14:09:22 +02:00
readiness_test.go	Wire HTTP server, refetch poller, and readiness probe into main.go	2026-04-23 14:09:22 +02:00
README.md	fix: address three bugs surfaced by the prod-mirror exercise	2026-05-12 10:13:13 +02:00
refetch.go	Wire HTTP server, refetch poller, and readiness probe into main.go	2026-04-23 14:09:22 +02:00
refetch_test.go	Wire HTTP server, refetch poller, and readiness probe into main.go	2026-04-23 14:09:22 +02:00
relay_client.go	fix: address three bugs surfaced by the prod-mirror exercise	2026-05-12 10:13:13 +02:00
relay_client_test.go	fix(indexer): adapt to upstream nostrlib; bump pin	2026-06-16 16:19:03 +02:00
schema.go	feat(indexer): send input_type to embed service; chunk embedding 768-dim	2026-06-29 14:27:31 +02:00
schema_test.go	feat(indexer): send input_type to embed service; chunk embedding 768-dim	2026-06-29 14:27:31 +02:00
search.go	feat(indexer): send input_type to embed service; chunk embedding 768-dim	2026-06-29 14:27:31 +02:00
search_test.go	feat(indexer): send input_type to embed service; chunk embedding 768-dim	2026-06-29 14:27:31 +02:00
ssrf.go	feat: SSRF guard with CIDR, host, TLD deny lists	2026-04-22 18:03:25 +02:00
ssrf_test.go	feat: SSRF guard with CIDR, host, TLD deny lists	2026-04-22 18:03:25 +02:00
store.go	fix: address three bugs surfaced by the dev.oersi headless deploy	2026-05-12 16:39:17 +02:00
store_test.go	fix: address three bugs surfaced by the dev.oersi headless deploy	2026-05-12 16:39:17 +02:00
tsclient.go	Add TSClient.Search for hybrid keyword+vector queries	2026-04-23 13:44:28 +02:00
tsclient_test.go	Add TSClient.Search for hybrid keyword+vector queries	2026-04-23 13:44:28 +02:00
ui.go	Add GET /ui comparison page	2026-04-24 00:31:06 +02:00
ui.html	Add GET /ui comparison page	2026-04-24 00:31:06 +02:00
ui_test.go	Add GET /ui comparison page	2026-04-24 00:31:06 +02:00
worker.go	feat(transferkiosk)!: drop kind 30145 — publications arrive as NKBIP-01 30040	2026-07-16 19:15:17 +02:00
worker_headless.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
worker_headless_test.go	feat(fetch): three-tier headless render fallback for thin/failed fetches	2026-05-12 14:48:57 +02:00
worker_test.go	test(publications): cover source-fallback URL extraction with real Alexandria event shapes	2026-07-16 16:28:52 +02:00

README.md

amb-indexer

What is this

amb-indexer is a standalone Go service that consumes kind-30142 AMB (Learning Resource Metadata) events from an amb-relay, fetches the referenced resource (HTML, PDF, or kind-30023 long-form article), extracts normalized text with structural metadata, chunks and embeds it, writes chunks to a Typesense collection, and writes fulltext back to the relay via NIP-86 setcontent.

It also serves POST /search_chunks (hybrid BM25+vector search with AMB metadata enrichment) and a small admin surface for dead-letter replay, force-refetch, and takedown.

How this fits with amb-relay

amb-indexer is a paired service — it requires a running amb-relay on the other end. The relay accepts 30142 events; the indexer subscribes to them, fetches/extracts/chunks/embeds the referenced resources, and writes fulltext back via NIP-86 setcontent. Chunks go to a separate Typesense collection (amb_chunks_30142) which the indexer also serves over HTTP at /search_chunks.

For the full system diagram and operator guide see amb-relay/docs/architecture.md. For the relay's NIP-86 surface (including the setcontent / refetchcontent / listrefetch / acknowledgerefetch methods used by this service) see amb-relay/README.md.

Run locally

cp .env.example .env
# edit .env — at minimum set INDEXER_NSEC
go run .

Tests:

go test ./...

Configuration

All configuration is via environment variables. See .env.example for the full list with defaults. The required variables are:

RELAY_URL — WebSocket URL of the amb-relay
RELAY_NIP86_URL — HTTP base URL for NIP-86 calls against the relay
INDEXER_NSEC — nsec the indexer signs NIP-98 and setcontent events with
TS_HOST, TS_APIKEY — Typesense endpoint and API key
EMBED_ENDPOINT — HTTP endpoint of the embedding service
TIKA_URL — base URL of the Apache Tika server (for PDFs)

Authorization: The pubkey derived from INDEXER_NSEC must be present in the relay's ADMIN_PUBKEYS (or be the relay's own PUBKEY). Without this the relay rejects setcontent calls and no fulltext lands. See amb-relay's ADMIN_PUBKEYS.

Run in docker compose

The amb-relay repo's docker-compose.yml wires amb-indexer alongside the relay, Typesense, and Tika. From the amb-relay directory:

cp ../amb-indexer/.env.indexer.example ../amb-indexer/.env.indexer
# edit .env.indexer — set INDEXER_NSEC at minimum
docker compose up --build

Deployment via published image

Forgejo CI publishes images to git.edufeed.org/edufeed/amb-indexer on push to main and on v* tags. Tag scheme:

main — latest commit on main
vX.Y.Z, vX.Y — semver releases
<short-sha> — pinned by commit

To consume the image instead of building from source, edit the amb-indexer service in amb-relay/docker-compose.yml:

- build: ../amb-indexer
+ image: git.edufeed.org/edufeed/amb-indexer:main

You can then run amb-relay's stack without cloning amb-indexer at all.

Where the data lives: the indexer_data named volume holds /data/indexer.db (BoltDB: cursor, event-hash, dead-letter tables).

How to check progress: docker compose logs -f amb-indexer for the processing log; query Typesense directly to see chunks:

curl -H "X-TYPESENSE-API-KEY: $TS_APIKEY" \
  "$TS_HOST/collections/amb_chunks_30142/documents/search?q=*&per_page=1"

Refetch lifecycle

Operators can mark events for re-ingestion via the relay's NIP-86 refetchcontent method. The relay records the event id in a BoltDB bucket; this indexer polls listrefetch every REFETCH_POLL_INTERVAL seconds, re-fetches the events it gets back, and calls acknowledgerefetch to clear the ids it accepted. The queue survives indexer downtime — restart the indexer and pending refetches resume on the next poll.

Mirror events from another relay

cmd/mirror-prod streams kind-30142 events from a source relay (default wss://amb-relay.edufeed.org) and republishes them verbatim to a destination relay. It paginates REQs by cursor (until = oldest_seen) and dedupes by id across pages, so it works against relays that cap per-REQ replies (prod khatru caps at 250).

Use it to stage prod corpus into a local stack for side-by-side search comparison, or to backfill a fresh relay before pointing the indexer at it.

cd cmd/mirror-prod
go run . -src wss://amb-relay.edufeed.org -dst ws://localhost:3334

Flags:

Flag	Default	Notes
`-src`	`wss://amb-relay.edufeed.org`	source relay ws URL
`-dst`	`ws://localhost:3334`	destination relay ws URL
`-since`	`0`	unix timestamp lower bound (0 = full corpus)
`-page-size`	`500`	per-REQ event cap (source may cap lower)
`-max-pages`	`200`	safety cap on pagination loop
`-max-events`	`0`	overall cap (0 = unlimited)
`-dry-run`	`false`	subscribe and count, do not publish
`-state-file`	(empty)	path to persistent resume checkpoint; deleted on clean exit
`-empty-page-retries`	`3`	retries before treating an empty page as end-of-stream
`-empty-page-retry-delay`	`2s`	sleep between empty-page retries

Common recipes:

# Full mirror, prod → local
go run ./cmd/mirror-prod

# Incremental top-up: only events newer than a known timestamp
go run ./cmd/mirror-prod -since 1798000000

# Probe without writing anything
go run ./cmd/mirror-prod -dry-run

# Local → local (e.g. staging another dev environment)
go run ./cmd/mirror-prod \
    -src ws://localhost:3334 \
    -dst ws://localhost:3335

# Resumable mirror — survives Ctrl-C, websocket drops, transient empties.
# Re-running with the same -state-file picks up where the previous run
# left off; the file is removed on clean completion.
go run ./cmd/mirror-prod -state-file /tmp/mirror.json

Re-runs are idempotent: relays dedupe by event id, the indexer dedupes by event_id → sha256(url).

What does not travel with the events:

No re-signing. Events are republished verbatim, so destination signature verification accepts them. If the source author was a banned pubkey on the destination, those events will be rejected.
No content / chunks / embeddings. Only the kind-30142 events themselves move. The indexer attached to the destination relay must re-fetch each resource and rebuild content + chunks. Plan for a full re-ingest run after a large mirror.
No kind-5 deletions (only -kinds defaults to [30142]). If the source has soft-deletions you want to honour, mirror those separately or use NIP-77 (Negentropy) instead.
No NIP-86 admin state (bans, allowlists, schema, semantic config). These live in BoltDB on each relay and are intentionally per-instance.

Alternatives

NIP-77 Negentropy. Subscribe with negentropy: true and let the protocol reconcile event sets. Cheaper than mirror-prod for small diffs. ⚠️ wss://amb-relay.edufeed.org currently caps Negentropy sessions at 250 events, so it cannot pull the full corpus this way — use mirror-prod for full backfills.
NIP-86 reindex. Local-only — drops the destination's Typesense collection and rebuilds from its own BoltDB. Useful after a mirror to ensure the search index reflects the freshly-arrived events, but does not move events between relays.
Raw BoltDB copy. If both relays are under your control and identical, copying relay.db while both are stopped is the fastest full clone — but skips the relay's accept-side validation, so use only between trusted instances.

HTTP API

When INDEXER_LISTEN is set (default :8080), the indexer exposes a small HTTP surface. Two bearer tokens gate access: INDEXER_API_TOKEN for search, INDEXER_ADMIN_TOKEN for admin. /health, /ready, and /metrics are unauthenticated.

Search

curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"q":"photosynthesis","k":5}' \
     http://localhost:8080/search_chunks | jq .

Response shape (one hit, trimmed):

{
  "total": 22,
  "hits": [
    {
      "chunk_id":    "054d419f...:96",
      "event_id":    "054d419f...",
      "event_coord": "30142:4686...:plan2b-test-html-1776955403",
      "chunk_idx":   96,
      "text":        "...full chunk text...",
      "snippet":     "...leading sentence...",
      "heading":     "References",
      "section_path":["Photosynthesis","References"],
      "source_url":  "https://en.wikipedia.org/wiki/Photosynthesis",
      "score":       1.0,
      "amb": {
        "name": "Photosynthesis (Wikipedia)",
        "description": "...",
        "license": "CC-BY-SA",
        "learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"]
      }
    }
  ]
}

score is normalized to [0, 1] (hybrid rank fusion when available, else vector-cosine fallback, else relative text_match). Non-permissive chunks have text elided but retain snippet so callers can still render a preview.

Filter by AMB metadata:

curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "q": "climate",
       "k": 10,
       "filter": {
         "learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"],
         "license_permissive": true
       }
     }' \
     http://localhost:8080/search_chunks | jq .

Admin

# List dead-letter entries
curl -sS -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/dead_letter | jq .

# Replay a single dead-lettered event
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/dead_letter/<event_id>/replay

# Bulk replay — optional reason_substr filter prevents accidentally
# replaying unrelated categories (e.g. redactions). Response:
# {total, replayed, skipped, failed, errors}.
curl -sS -X POST \
     -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"reason_substr":"no source URL","limit":100}' \
     http://localhost:8080/admin/dead_letter/replay_all

# Force re-ingest an event (also arms relay-side needs_refetch)
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/refetch/<event_id>

# Takedown — delete chunks + mark relay content redacted + dead-letter
curl -sS -X DELETE -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     http://localhost:8080/admin/event/<event_id>

# Reset the source-subscription cursor and force the source to re-subscribe.
# Use after bulk-loading historical events into the relay (mirror, restore,
# import) so the indexer picks them up. timestamp=0 reprocesses everything;
# any past unix timestamp catches up from that point. The event_hash cache is
# intentionally NOT cleared — already-indexed events short-circuit, so a full
# reset is cheap.
curl -sS -X POST \
     -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"timestamp":0}' \
     http://localhost:8080/admin/cursor

Probes

curl -sS http://localhost:8080/health        # liveness: 200 always
curl -sS http://localhost:8080/ready         # readiness: 200 iff deps fresh
curl -sS http://localhost:8080/metrics       # Prometheus text

Readiness returns 503 if Typesense or the relay hasn't been marked healthy within 60s. The TS prober runs every 15s; relay freshness is stamped by the refetch poller's listrefetch call (default every 30s).

MCP server (LLM context source)

cmd/mcp is a standalone binary that exposes the indexer's /search_chunks engine as an MCP tool named search_educational_chunks. LLM clients (Claude Code, Claude Desktop, Cursor, etc.) can register the indexer as a context source in one config line.

Build:

go build -o ./bin/amb-indexer-mcp ./cmd/mcp

The binary speaks MCP over stdio. It needs:

-indexer-url (or INDEXER_URL) — defaults to http://localhost:8080
-indexer-token (or INDEXER_API_TOKEN) — same value the operator configured on the indexer for /search_chunks. Required.

Register with Claude Code

In your Claude Code config (~/.claude/mcp.json or per-project):

{
  "mcpServers": {
    "amb-indexer": {
      "command": "/abs/path/to/amb-indexer-mcp",
      "args": ["-indexer-url", "http://localhost:8080"],
      "env": {
        "INDEXER_API_TOKEN": "your-token-here"
      }
    }
  }
}

Restart Claude Code. The search_educational_chunks tool will appear in the tool list. Ask "find me OER worksheets about photosynthesis" and the model will call the tool with the appropriate filter.

Tool schema

search_educational_chunks({
  "query": "natural-language query",       // required
  "k": 10,                                 // optional, server-capped at 100
  "filter": {                              // optional, all fields optional
    "about_id":              ["https://w3id.org/kim/schulfaecher/s1017"],
    "learning_resource_type":["https://w3id.org/kim/hcrt/worksheet"],
    "license":               ["CC-BY-4.0"],
    "license_permissive":    true
  }
})
// → {"total": N, "hits": [ ... same shape as POST /search_chunks ... ]}

The tool's response schema is identical to the HTTP endpoint's SearchResponse. Non-permissive chunks have text elided but retain snippet so the LLM can still ground its answer on a fair-use fragment.

Operator notes

The compose file keeps port 8080 on the internal docker network only. Add an override to expose it locally, e.g. ports: ["127.0.0.1:8080:8080"].
Generate tokens with openssl rand -hex 32. Both tokens are required whenever INDEXER_LISTEN is non-empty; set INDEXER_LISTEN="" for a pure-ingest deployment.