No description
  • TypeScript 75%
  • JavaScript 25%
Find a file
@s.roertgen 66ae1d7d5c
All checks were successful
Publish Dev Build / publish (push) Successful in 19s
updates
2026-01-30 13:19:49 +01:00
.forgejo/workflows lots of improvements, including publishing pipeline 2026-01-30 13:05:37 +01:00
src lots of improvements, including publishing pipeline 2026-01-30 13:05:37 +01:00
.gitignore initial commit 2025-12-09 14:05:35 +01:00
CLAUDE.md lots of improvements, including publishing pipeline 2026-01-30 13:05:37 +01:00
package-lock.json updates 2026-01-30 13:19:49 +01:00
package.json lots of improvements, including publishing pipeline 2026-01-30 13:05:37 +01:00
README.md lots of improvements, including publishing pipeline 2026-01-30 13:05:37 +01:00
tsconfig.json initial commit 2025-12-09 14:05:35 +01:00
vitest.config.ts initial commit 2025-12-09 14:05:35 +01:00

amb-sitemap-parser

A modern TypeScript library for parsing sitemaps and extracting educational metadata from web pages. Built with pure ESM for Node.js 20+.

Features

  • Parse XML sitemaps (regular sitemaps and sitemap indexes)
  • Concurrent page fetching with rate limiting
  • Extract JSON-LD metadata from HTML pages
  • Filter educational content automatically
  • Full TypeScript support with exported types
  • Pure ESM for modern JavaScript environments
  • Tested with Vitest

Installation

npm install amb-sitemap-parser

Requirements

  • Node.js >= 20.0.0
  • ESM-compatible project (set "type": "module" in package.json)

CLI Usage

The package includes a command-line interface for parsing sitemaps and extracting educational metadata.

Installation

# Install globally for CLI usage
npm install -g amb-sitemap-parser

# Or use with npx (no installation needed)
npx amb-sitemap-parser --help

Commands

parse - Parse a sitemap and list URLs

# Basic usage
amb-sitemap-parser parse https://example.com/sitemap.xml

# Limit to first 10 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 10

# Pretty-print JSON output
amb-sitemap-parser parse https://example.com/sitemap.xml --pretty

# Combine options
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 20 --pretty

Output:

{
  "urls": ["url1", "url2", "url3"],
  "isIndex": false,
  "count": 3
}

extract - Extract educational metadata

Full pipeline: parse sitemap → fetch pages → extract metadata → filter educational content

# Basic usage
amb-sitemap-parser extract https://example.com/sitemap.xml

# With verbose logging (logs to stderr)
amb-sitemap-parser extract https://example.com/sitemap.xml --verbose

# Limit URLs and adjust concurrency
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 20 --max-concurrency 10

# Pretty-print output
amb-sitemap-parser extract https://example.com/sitemap.xml --pretty

# Custom timeout (in milliseconds)
amb-sitemap-parser extract https://example.com/sitemap.xml --timeout 60000

# Save to file
amb-sitemap-parser extract https://example.com/sitemap.xml > results.json

Output:

{
  "metadata": [
    {
      "url": "https://example.com/course",
      "title": "Introduction to Python",
      "description": "Learn Python basics",
      "jsonLdData": [
        {
          "@type": "Course",
          "name": "Introduction to Python"
        }
      ],
      "extractionTime": 45
    }
  ],
  "summary": {
    "totalUrls": 100,
    "fetched": 95,
    "withMetadata": 78,
    "educational": 42
  }
}

fetch - Fetch URL(s) directly and extract metadata

Extract educational metadata from specific URLs without needing a sitemap.

# Fetch a single URL
amb-sitemap-parser fetch https://example.com/course

# Fetch multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2

# With verbose logging
amb-sitemap-parser fetch https://example.com/course --verbose

# Adjust concurrency for multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2 --max-concurrency 10

# Pretty-print output
amb-sitemap-parser fetch https://example.com/course --pretty

# Custom timeout (in milliseconds)
amb-sitemap-parser fetch https://example.com/course --timeout 60000

# Save to file
amb-sitemap-parser fetch https://example.com/course > result.json

Output: (same format as extract command)

{
  "metadata": [
    {
      "url": "https://example.com/course",
      "title": "Introduction to Python",
      "description": "Learn Python basics",
      "jsonLdData": [
        {
          "@type": "Course",
          "name": "Introduction to Python"
        }
      ],
      "extractionTime": 45
    }
  ],
  "summary": {
    "totalUrls": 1,
    "fetched": 1,
    "withMetadata": 1,
    "educational": 1
  }
}

CLI Options Reference

parse command options:

  • -l, --limit <number> - Limit to first N URLs
  • -p, --pretty - Pretty-print JSON output
  • -k, --insecure - Skip SSL certificate verification

extract command options:

  • -l, --limit <number> - Limit URLs to process
  • -c, --max-concurrency <number> - Maximum concurrent requests (default: 5)
  • -t, --timeout <number> - Request timeout in milliseconds (default: 30000)
  • -o, --output <filepath> - Save metadata to JSONL file (one JSON-LD object per line)
  • --jsonl - Stream one JSON-LD resource per line to stdout
  • -p, --pretty - Pretty-print JSON output
  • -v, --verbose - Show progress logs (written to stderr)
  • -q, --quiet - Suppress progress output
  • -k, --insecure - Skip SSL certificate verification

fetch command options:

  • -c, --max-concurrency <number> - Maximum concurrent requests (default: 5)
  • -t, --timeout <number> - Request timeout in milliseconds (default: 30000)
  • -o, --output <filepath> - Save metadata to JSONL file (one JSON-LD object per line)
  • --jsonl - Stream one JSON-LD resource per line to stdout
  • -p, --pretty - Pretty-print JSON output
  • -v, --verbose - Show progress logs (written to stderr)
  • -q, --quiet - Suppress progress output
  • -k, --insecure - Skip SSL certificate verification

JSONL Output Format

When using the --output option, metadata is saved in JSON Lines (JSONL) format - one JSON-LD object per line. This format is ideal for streaming, piping, and processing with standard Unix tools.

How It Works

Each JSON-LD object from a page becomes a separate line in the file:

{"@type":"Course","name":"Python 101","url":"https://example.com/course1"}
{"@type":"LearningResource","name":"Exercise 1","url":"https://example.com/course1"}
{"@type":"Course","name":"JavaScript Basics","url":"https://example.com/course2"}

Each line is a complete, independent record. If a page has multiple JSON-LD objects, each gets its own line. This format is ideal for streaming and Unix pipelines (jq, grep, awk).

Output Behavior

With --output flag:

amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl
  • File (results.jsonl): Contains JSONL records (one JSON-LD object per line)
  • stdout: Summary statistics only
    {
      "summary": {
        "totalUrls": 100,
        "fetched": 95,
        "withMetadata": 78,
        "educational": 45,
        "recordsWritten": 67
      }
    }
    

Without --output flag:

amb-sitemap-parser extract https://example.com/sitemap.xml
  • stdout: Full metadata array + summary (existing behavior)

CLI Examples

# Quick test: parse and see first 5 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 5 --pretty

# Extract metadata from first 10 URLs with verbose logging
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 10 --verbose --pretty

# High-performance extraction: 20 concurrent requests
amb-sitemap-parser extract https://example.com/sitemap.xml --max-concurrency 20

# Save to JSONL file
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl --verbose

# Save from direct URL fetch
amb-sitemap-parser fetch https://example.com/course --output course.jsonl

# Pipeline with jq for further processing
amb-sitemap-parser extract https://example.com/sitemap.xml | jq '.metadata[] | select(.jsonLd != null)'

Integration with AMB-Nostr-Converter

Extract AMB resources and convert to Nostr events with amb-convert:

# Extract to JSONL, then convert to signed Nostr events
amb-sitemap-parser extract https://example.com/sitemap.xml -o resources.jsonl
cat resources.jsonl | amb-convert amb:nostr --nsec $NOSTR_NSEC -o events.jsonl

# Or pipe directly
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl | \
  amb-convert amb:nostr --nsec $NOSTR_NSEC -o events.jsonl

Programmatic Usage (Node.js)

Quick Start

import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';

// Parse a sitemap
const parser = new SitemapParser();
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');

// Fetch pages
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const results = await fetcher.fetchPages(sitemap.urls.slice(0, 10));

// Extract metadata
const extractor = new MetadataExtractor();
const metadata = await extractor.extractFromPages(results);

// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(educational);

API Reference

SitemapParser

Parse XML sitemaps and extract URLs.

const parser = new SitemapParser({
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Parse from string
const sitemap = await parser.parseSitemap(xmlContent);

// Parse from URL
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');

// Validate URL
const isValid = SitemapParser.isValidSitemapUrl(url);

// Filter educational URLs
const filtered = SitemapParser.filterEducationalUrls(sitemap.urls);

PageFetcher

Fetch web pages with concurrency control and rate limiting.

const fetcher = new PageFetcher({
  maxConcurrency: 5,
  timeout: 30000,
  delayBetweenRequests: 100,
  retryAttempts: 2,
  retryDelay: 1000,
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Fetch multiple pages
const results = await fetcher.fetchPages(urls);

// Fetch single page
const result = await fetcher.fetchSinglePage('https://example.com/page');

// Validate URL
const isValid = PageFetcher.isValidUrl(url);

// Filter valid URLs
const validUrls = PageFetcher.filterValidUrls(urls);

MetadataExtractor

Extract metadata including JSON-LD from HTML pages.

const extractor = new MetadataExtractor({
  validateSchema: false,
  logger: (msg, level) => console.log(`[${level}] ${msg}`)
});

// Extract from multiple pages
const metadata = await extractor.extractFromPages(fetchResults);

// Extract from single page
const metadata = await extractor.extractFromPage(fetchResult);

// Filter educational metadata
const educational = MetadataExtractor.filterEducationalMetadata(metadata);

// Check if has valid content
const hasContent = MetadataExtractor.hasValidContent(metadata);

Types

All types are exported and can be imported:

import type {
  SitemapUrl,
  ParsedSitemap,
  FetchResult,
  FetchOptions,
  ExtractedMetadata,
  LoggerFunction,
} from 'amb-sitemap-parser';

Tree-Shakeable Imports

Import only what you need for smaller bundle sizes:

import { SitemapParser } from 'amb-sitemap-parser/sitemap';
import { PageFetcher } from 'amb-sitemap-parser/fetcher';
import { MetadataExtractor } from 'amb-sitemap-parser/extractor';

Examples

Complete Workflow

import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';

async function processSitemap(sitemapUrl: string) {
  // Initialize components
  const parser = new SitemapParser();
  const fetcher = new PageFetcher({ maxConcurrency: 5 });
  const extractor = new MetadataExtractor();

  // Parse sitemap
  const sitemap = await parser.parseFromUrl(sitemapUrl);
  console.log(`Found ${sitemap.urls.length} URLs`);

  // Limit to first 50 URLs
  const urlsToProcess = sitemap.urls.slice(0, 50);

  // Fetch pages
  const results = await fetcher.fetchPages(urlsToProcess);
  console.log(`Fetched ${results.filter(r => r.success).length} pages successfully`);

  // Extract metadata
  const metadata = await extractor.extractFromPages(results);
  
  // Filter educational content
  const educational = MetadataExtractor.filterEducationalMetadata(metadata);
  console.log(`Found ${educational.length} educational resources`);

  return educational;
}

With Custom Logger

const logger = (message: string, level: 'info' | 'warn' | 'error') => {
  const timestamp = new Date().toISOString();
  console.log(`[${timestamp}] [${level.toUpperCase()}] ${message}`);
};

const parser = new SitemapParser({ logger });
const fetcher = new PageFetcher({ logger, maxConcurrency: 3 });
const extractor = new MetadataExtractor({ logger });

Development

# Install dependencies
npm install

# Build the library
npm run build

# Run tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with UI
npm run test:ui

# Generate coverage report
npm run test:coverage

# Lint code
npm run lint

# Format code
npm run format

License

Unlicense

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.