- Go 90.9%
- Shell 7.3%
- HTML 1.6%
- Dockerfile 0.2%
|
All checks were successful
Build and Push Docker Image / build (push) Successful in 1m19s
Real AMB d-tags are URLs; "30142:pub:https://example.org/?a=1&b=2" would break Typesense's filter_by parser with "Could not parse the filter query" (HTTP 400), losing chunks for re-published events. The takedown path in admin.go already used backticks; align the worker path and add a regression test using a real Edu-Sharing-style d-tag seen during the oersi-dev backfill on 2026-05-15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .forgejo/workflows | ||
| certs | ||
| cmd | ||
| testdata/fixtures | ||
| .env.example | ||
| .env.indexer.example | ||
| .gitignore | ||
| admin.go | ||
| admin_test.go | ||
| ambcache.go | ||
| ambcache_test.go | ||
| chunker.go | ||
| chunker_test.go | ||
| CLAUDE.md | ||
| config.go | ||
| config_test.go | ||
| Dockerfile | ||
| e2e_test.sh | ||
| embed.go | ||
| embed_test.go | ||
| fetch.go | ||
| fetch_headless.go | ||
| fetch_headless_test.go | ||
| fetch_html.go | ||
| fetch_nostr.go | ||
| fetch_nostr_test.go | ||
| fetch_pdf.go | ||
| fetch_test.go | ||
| go.mod | ||
| go.sum | ||
| HEADLESS_FETCH_SPEC.md | ||
| http_server.go | ||
| http_server_test.go | ||
| license.go | ||
| license_test.go | ||
| main.go | ||
| metrics.go | ||
| metrics_test.go | ||
| nostr_source.go | ||
| nostr_source_test.go | ||
| politeness.go | ||
| politeness_test.go | ||
| readiness.go | ||
| readiness_test.go | ||
| README.md | ||
| refetch.go | ||
| refetch_test.go | ||
| relay_client.go | ||
| relay_client_test.go | ||
| schema.go | ||
| schema_test.go | ||
| search.go | ||
| search_test.go | ||
| ssrf.go | ||
| ssrf_test.go | ||
| store.go | ||
| store_test.go | ||
| tsclient.go | ||
| tsclient_test.go | ||
| ui.go | ||
| ui.html | ||
| ui_test.go | ||
| worker.go | ||
| worker_headless.go | ||
| worker_headless_test.go | ||
| worker_test.go | ||
amb-indexer
What is this
amb-indexer is a standalone Go service that consumes kind-30142 AMB
(Learning Resource Metadata) events from an amb-relay, fetches the
referenced resource (HTML, PDF, or kind-30023 long-form article),
extracts normalized text with structural metadata, chunks and embeds
it, writes chunks to a Typesense collection, and writes fulltext back
to the relay via NIP-86 setcontent.
It also serves POST /search_chunks (hybrid BM25+vector search with
AMB metadata enrichment) and a small admin surface for dead-letter
replay, force-refetch, and takedown.
How this fits with amb-relay
amb-indexer is a paired service — it requires a running amb-relay on the other end. The relay accepts 30142 events; the indexer subscribes to them, fetches/extracts/chunks/embeds the referenced resources, and writes fulltext back via NIP-86 setcontent. Chunks go to a separate Typesense collection (amb_chunks_30142) which the indexer also serves over HTTP at /search_chunks.
For the full system diagram and operator guide see amb-relay/docs/architecture.md. For the relay's NIP-86 surface (including the setcontent / refetchcontent / listrefetch / acknowledgerefetch methods used by this service) see amb-relay/README.md.
Run locally
cp .env.example .env
# edit .env — at minimum set INDEXER_NSEC
go run .
Tests:
go test ./...
Configuration
All configuration is via environment variables. See
.env.example for the full list with defaults. The
required variables are:
RELAY_URL— WebSocket URL of the amb-relayRELAY_NIP86_URL— HTTP base URL for NIP-86 calls against the relayINDEXER_NSEC— nsec the indexer signs NIP-98 andsetcontentevents withTS_HOST,TS_APIKEY— Typesense endpoint and API keyEMBED_ENDPOINT— HTTP endpoint of the embedding serviceTIKA_URL— base URL of the Apache Tika server (for PDFs)
Authorization: The pubkey derived from
INDEXER_NSECmust be present in the relay'sADMIN_PUBKEYS(or be the relay's ownPUBKEY). Without this the relay rejectssetcontentcalls and no fulltext lands. See amb-relay'sADMIN_PUBKEYS.
Run in docker compose
The amb-relay repo's docker-compose.yml wires amb-indexer alongside the
relay, Typesense, and Tika. From the amb-relay directory:
cp ../amb-indexer/.env.indexer.example ../amb-indexer/.env.indexer
# edit .env.indexer — set INDEXER_NSEC at minimum
docker compose up --build
Deployment via published image
Forgejo CI publishes images to git.edufeed.org/edufeed/amb-indexer on push to main and on v* tags. Tag scheme:
main— latest commit on mainvX.Y.Z,vX.Y— semver releases<short-sha>— pinned by commit
To consume the image instead of building from source, edit the amb-indexer service in amb-relay/docker-compose.yml:
- build: ../amb-indexer
+ image: git.edufeed.org/edufeed/amb-indexer:main
You can then run amb-relay's stack without cloning amb-indexer at all.
Where the data lives: the indexer_data named volume holds
/data/indexer.db (BoltDB: cursor, event-hash, dead-letter tables).
How to check progress: docker compose logs -f amb-indexer for the
processing log; query Typesense directly to see chunks:
curl -H "X-TYPESENSE-API-KEY: $TS_APIKEY" \
"$TS_HOST/collections/amb_chunks_30142/documents/search?q=*&per_page=1"
Refetch lifecycle
Operators can mark events for re-ingestion via the relay's NIP-86 refetchcontent method. The relay records the event id in a BoltDB bucket; this indexer polls listrefetch every REFETCH_POLL_INTERVAL seconds, re-fetches the events it gets back, and calls acknowledgerefetch to clear the ids it accepted. The queue survives indexer downtime — restart the indexer and pending refetches resume on the next poll.
Mirror events from another relay
cmd/mirror-prod streams kind-30142 events from a source relay (default
wss://amb-relay.edufeed.org) and republishes them verbatim to a
destination relay. It paginates REQs by cursor (until = oldest_seen)
and dedupes by id across pages, so it works against relays that cap
per-REQ replies (prod khatru caps at 250).
Use it to stage prod corpus into a local stack for side-by-side search comparison, or to backfill a fresh relay before pointing the indexer at it.
cd cmd/mirror-prod
go run . -src wss://amb-relay.edufeed.org -dst ws://localhost:3334
Flags:
| Flag | Default | Notes |
|---|---|---|
-src |
wss://amb-relay.edufeed.org |
source relay ws URL |
-dst |
ws://localhost:3334 |
destination relay ws URL |
-since |
0 |
unix timestamp lower bound (0 = full corpus) |
-page-size |
500 |
per-REQ event cap (source may cap lower) |
-max-pages |
200 |
safety cap on pagination loop |
-max-events |
0 |
overall cap (0 = unlimited) |
-dry-run |
false |
subscribe and count, do not publish |
-state-file |
(empty) | path to persistent resume checkpoint; deleted on clean exit |
-empty-page-retries |
3 |
retries before treating an empty page as end-of-stream |
-empty-page-retry-delay |
2s |
sleep between empty-page retries |
Common recipes:
# Full mirror, prod → local
go run ./cmd/mirror-prod
# Incremental top-up: only events newer than a known timestamp
go run ./cmd/mirror-prod -since 1798000000
# Probe without writing anything
go run ./cmd/mirror-prod -dry-run
# Local → local (e.g. staging another dev environment)
go run ./cmd/mirror-prod \
-src ws://localhost:3334 \
-dst ws://localhost:3335
# Resumable mirror — survives Ctrl-C, websocket drops, transient empties.
# Re-running with the same -state-file picks up where the previous run
# left off; the file is removed on clean completion.
go run ./cmd/mirror-prod -state-file /tmp/mirror.json
Re-runs are idempotent: relays dedupe by event id, the indexer dedupes
by event_id → sha256(url).
What does not travel with the events:
- No re-signing. Events are republished verbatim, so destination signature verification accepts them. If the source author was a banned pubkey on the destination, those events will be rejected.
- No content / chunks / embeddings. Only the kind-30142 events themselves move. The indexer attached to the destination relay must re-fetch each resource and rebuild content + chunks. Plan for a full re-ingest run after a large mirror.
- No kind-5 deletions (only
-kindsdefaults to[30142]). If the source has soft-deletions you want to honour, mirror those separately or use NIP-77 (Negentropy) instead. - No NIP-86 admin state (bans, allowlists, schema, semantic config). These live in BoltDB on each relay and are intentionally per-instance.
Alternatives
- NIP-77 Negentropy. Subscribe with
negentropy: trueand let the protocol reconcile event sets. Cheaper than mirror-prod for small diffs. ⚠️wss://amb-relay.edufeed.orgcurrently caps Negentropy sessions at 250 events, so it cannot pull the full corpus this way — use mirror-prod for full backfills. - NIP-86
reindex. Local-only — drops the destination's Typesense collection and rebuilds from its own BoltDB. Useful after a mirror to ensure the search index reflects the freshly-arrived events, but does not move events between relays. - Raw BoltDB copy. If both relays are under your control and
identical, copying
relay.dbwhile both are stopped is the fastest full clone — but skips the relay's accept-side validation, so use only between trusted instances.
HTTP API
When INDEXER_LISTEN is set (default :8080), the indexer exposes a
small HTTP surface. Two bearer tokens gate access: INDEXER_API_TOKEN
for search, INDEXER_ADMIN_TOKEN for admin. /health, /ready, and
/metrics are unauthenticated.
Search
curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"q":"photosynthesis","k":5}' \
http://localhost:8080/search_chunks | jq .
Response shape (one hit, trimmed):
{
"total": 22,
"hits": [
{
"chunk_id": "054d419f...:96",
"event_id": "054d419f...",
"event_coord": "30142:4686...:plan2b-test-html-1776955403",
"chunk_idx": 96,
"text": "...full chunk text...",
"snippet": "...leading sentence...",
"heading": "References",
"section_path":["Photosynthesis","References"],
"source_url": "https://en.wikipedia.org/wiki/Photosynthesis",
"score": 1.0,
"amb": {
"name": "Photosynthesis (Wikipedia)",
"description": "...",
"license": "CC-BY-SA",
"learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"]
}
}
]
}
score is normalized to [0, 1] (hybrid rank fusion when available,
else vector-cosine fallback, else relative text_match). Non-permissive
chunks have text elided but retain snippet so callers can still
render a preview.
Filter by AMB metadata:
curl -sS -H "Authorization: Bearer $INDEXER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"q": "climate",
"k": 10,
"filter": {
"learning_resource_type": ["https://w3id.org/kim/hcrt/worksheet"],
"license_permissive": true
}
}' \
http://localhost:8080/search_chunks | jq .
Admin
# List dead-letter entries
curl -sS -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
http://localhost:8080/admin/dead_letter | jq .
# Replay a single dead-lettered event
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
http://localhost:8080/admin/dead_letter/<event_id>/replay
# Bulk replay — optional reason_substr filter prevents accidentally
# replaying unrelated categories (e.g. redactions). Response:
# {total, replayed, skipped, failed, errors}.
curl -sS -X POST \
-H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason_substr":"no source URL","limit":100}' \
http://localhost:8080/admin/dead_letter/replay_all
# Force re-ingest an event (also arms relay-side needs_refetch)
curl -sS -X POST -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
http://localhost:8080/admin/refetch/<event_id>
# Takedown — delete chunks + mark relay content redacted + dead-letter
curl -sS -X DELETE -H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
http://localhost:8080/admin/event/<event_id>
# Reset the source-subscription cursor and force the source to re-subscribe.
# Use after bulk-loading historical events into the relay (mirror, restore,
# import) so the indexer picks them up. timestamp=0 reprocesses everything;
# any past unix timestamp catches up from that point. The event_hash cache is
# intentionally NOT cleared — already-indexed events short-circuit, so a full
# reset is cheap.
curl -sS -X POST \
-H "Authorization: Bearer $INDEXER_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"timestamp":0}' \
http://localhost:8080/admin/cursor
Probes
curl -sS http://localhost:8080/health # liveness: 200 always
curl -sS http://localhost:8080/ready # readiness: 200 iff deps fresh
curl -sS http://localhost:8080/metrics # Prometheus text
Readiness returns 503 if Typesense or the relay hasn't been marked
healthy within 60s. The TS prober runs every 15s; relay freshness is
stamped by the refetch poller's listrefetch call (default every 30s).
MCP server (LLM context source)
cmd/mcp is a standalone binary that exposes the indexer's
/search_chunks engine as an MCP
tool named search_educational_chunks. LLM clients (Claude Code,
Claude Desktop, Cursor, etc.) can register the indexer as a context
source in one config line.
Build:
go build -o ./bin/amb-indexer-mcp ./cmd/mcp
The binary speaks MCP over stdio. It needs:
-indexer-url(orINDEXER_URL) — defaults tohttp://localhost:8080-indexer-token(orINDEXER_API_TOKEN) — same value the operator configured on the indexer for/search_chunks. Required.
Register with Claude Code
In your Claude Code config (~/.claude/mcp.json or per-project):
{
"mcpServers": {
"amb-indexer": {
"command": "/abs/path/to/amb-indexer-mcp",
"args": ["-indexer-url", "http://localhost:8080"],
"env": {
"INDEXER_API_TOKEN": "your-token-here"
}
}
}
}
Restart Claude Code. The search_educational_chunks tool will appear
in the tool list. Ask "find me OER worksheets about photosynthesis"
and the model will call the tool with the appropriate filter.
Tool schema
search_educational_chunks({
"query": "natural-language query", // required
"k": 10, // optional, server-capped at 100
"filter": { // optional, all fields optional
"about_id": ["https://w3id.org/kim/schulfaecher/s1017"],
"learning_resource_type":["https://w3id.org/kim/hcrt/worksheet"],
"license": ["CC-BY-4.0"],
"license_permissive": true
}
})
// → {"total": N, "hits": [ ... same shape as POST /search_chunks ... ]}
The tool's response schema is identical to the HTTP endpoint's
SearchResponse. Non-permissive chunks have text elided but retain
snippet so the LLM can still ground its answer on a fair-use fragment.
Operator notes
- The compose file keeps port
8080on the internal docker network only. Add an override to expose it locally, e.g.ports: ["127.0.0.1:8080:8080"]. - Generate tokens with
openssl rand -hex 32. Both tokens are required wheneverINDEXER_LISTENis non-empty; setINDEXER_LISTEN=""for a pure-ingest deployment.