- TypeScript 75%
- JavaScript 25%
|
|
||
|---|---|---|
| .forgejo/workflows | ||
| src | ||
| .gitignore | ||
| CLAUDE.md | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| tsconfig.json | ||
| vitest.config.ts | ||
amb-sitemap-parser
A modern TypeScript library for parsing sitemaps and extracting educational metadata from web pages. Built with pure ESM for Node.js 20+.
Features
- Parse XML sitemaps (regular sitemaps and sitemap indexes)
- Concurrent page fetching with rate limiting
- Extract JSON-LD metadata from HTML pages
- Filter educational content automatically
- Full TypeScript support with exported types
- Pure ESM for modern JavaScript environments
- Tested with Vitest
Installation
npm install amb-sitemap-parser
Requirements
- Node.js >= 20.0.0
- ESM-compatible project (set
"type": "module"in package.json)
CLI Usage
The package includes a command-line interface for parsing sitemaps and extracting educational metadata.
Installation
# Install globally for CLI usage
npm install -g amb-sitemap-parser
# Or use with npx (no installation needed)
npx amb-sitemap-parser --help
Commands
parse - Parse a sitemap and list URLs
# Basic usage
amb-sitemap-parser parse https://example.com/sitemap.xml
# Limit to first 10 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 10
# Pretty-print JSON output
amb-sitemap-parser parse https://example.com/sitemap.xml --pretty
# Combine options
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 20 --pretty
Output:
{
"urls": ["url1", "url2", "url3"],
"isIndex": false,
"count": 3
}
extract - Extract educational metadata
Full pipeline: parse sitemap → fetch pages → extract metadata → filter educational content
# Basic usage
amb-sitemap-parser extract https://example.com/sitemap.xml
# With verbose logging (logs to stderr)
amb-sitemap-parser extract https://example.com/sitemap.xml --verbose
# Limit URLs and adjust concurrency
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 20 --max-concurrency 10
# Pretty-print output
amb-sitemap-parser extract https://example.com/sitemap.xml --pretty
# Custom timeout (in milliseconds)
amb-sitemap-parser extract https://example.com/sitemap.xml --timeout 60000
# Save to file
amb-sitemap-parser extract https://example.com/sitemap.xml > results.json
Output:
{
"metadata": [
{
"url": "https://example.com/course",
"title": "Introduction to Python",
"description": "Learn Python basics",
"jsonLdData": [
{
"@type": "Course",
"name": "Introduction to Python"
}
],
"extractionTime": 45
}
],
"summary": {
"totalUrls": 100,
"fetched": 95,
"withMetadata": 78,
"educational": 42
}
}
fetch - Fetch URL(s) directly and extract metadata
Extract educational metadata from specific URLs without needing a sitemap.
# Fetch a single URL
amb-sitemap-parser fetch https://example.com/course
# Fetch multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2
# With verbose logging
amb-sitemap-parser fetch https://example.com/course --verbose
# Adjust concurrency for multiple URLs
amb-sitemap-parser fetch https://example.com/course1 https://example.com/course2 --max-concurrency 10
# Pretty-print output
amb-sitemap-parser fetch https://example.com/course --pretty
# Custom timeout (in milliseconds)
amb-sitemap-parser fetch https://example.com/course --timeout 60000
# Save to file
amb-sitemap-parser fetch https://example.com/course > result.json
Output: (same format as extract command)
{
"metadata": [
{
"url": "https://example.com/course",
"title": "Introduction to Python",
"description": "Learn Python basics",
"jsonLdData": [
{
"@type": "Course",
"name": "Introduction to Python"
}
],
"extractionTime": 45
}
],
"summary": {
"totalUrls": 1,
"fetched": 1,
"withMetadata": 1,
"educational": 1
}
}
CLI Options Reference
parse command options:
-l, --limit <number>- Limit to first N URLs-p, --pretty- Pretty-print JSON output-k, --insecure- Skip SSL certificate verification
extract command options:
-l, --limit <number>- Limit URLs to process-c, --max-concurrency <number>- Maximum concurrent requests (default: 5)-t, --timeout <number>- Request timeout in milliseconds (default: 30000)-o, --output <filepath>- Save metadata to JSONL file (one JSON-LD object per line)--jsonl- Stream one JSON-LD resource per line to stdout-p, --pretty- Pretty-print JSON output-v, --verbose- Show progress logs (written to stderr)-q, --quiet- Suppress progress output-k, --insecure- Skip SSL certificate verification
fetch command options:
-c, --max-concurrency <number>- Maximum concurrent requests (default: 5)-t, --timeout <number>- Request timeout in milliseconds (default: 30000)-o, --output <filepath>- Save metadata to JSONL file (one JSON-LD object per line)--jsonl- Stream one JSON-LD resource per line to stdout-p, --pretty- Pretty-print JSON output-v, --verbose- Show progress logs (written to stderr)-q, --quiet- Suppress progress output-k, --insecure- Skip SSL certificate verification
JSONL Output Format
When using the --output option, metadata is saved in JSON Lines (JSONL) format - one JSON-LD object per line. This format is ideal for streaming, piping, and processing with standard Unix tools.
How It Works
Each JSON-LD object from a page becomes a separate line in the file:
{"@type":"Course","name":"Python 101","url":"https://example.com/course1"}
{"@type":"LearningResource","name":"Exercise 1","url":"https://example.com/course1"}
{"@type":"Course","name":"JavaScript Basics","url":"https://example.com/course2"}
Each line is a complete, independent record. If a page has multiple JSON-LD objects, each gets its own line. This format is ideal for streaming and Unix pipelines (jq, grep, awk).
Output Behavior
With --output flag:
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl
- File (results.jsonl): Contains JSONL records (one JSON-LD object per line)
- stdout: Summary statistics only
{ "summary": { "totalUrls": 100, "fetched": 95, "withMetadata": 78, "educational": 45, "recordsWritten": 67 } }
Without --output flag:
amb-sitemap-parser extract https://example.com/sitemap.xml
- stdout: Full metadata array + summary (existing behavior)
CLI Examples
# Quick test: parse and see first 5 URLs
amb-sitemap-parser parse https://example.com/sitemap.xml --limit 5 --pretty
# Extract metadata from first 10 URLs with verbose logging
amb-sitemap-parser extract https://example.com/sitemap.xml --limit 10 --verbose --pretty
# High-performance extraction: 20 concurrent requests
amb-sitemap-parser extract https://example.com/sitemap.xml --max-concurrency 20
# Save to JSONL file
amb-sitemap-parser extract https://example.com/sitemap.xml --output results.jsonl --verbose
# Save from direct URL fetch
amb-sitemap-parser fetch https://example.com/course --output course.jsonl
# Pipeline with jq for further processing
amb-sitemap-parser extract https://example.com/sitemap.xml | jq '.metadata[] | select(.jsonLd != null)'
Integration with AMB-Nostr-Converter
Extract AMB resources and convert to Nostr events with amb-convert:
# Extract to JSONL, then convert to signed Nostr events
amb-sitemap-parser extract https://example.com/sitemap.xml -o resources.jsonl
cat resources.jsonl | amb-convert amb:nostr --nsec $NOSTR_NSEC -o events.jsonl
# Or pipe directly
amb-sitemap-parser extract https://example.com/sitemap.xml --jsonl | \
amb-convert amb:nostr --nsec $NOSTR_NSEC -o events.jsonl
Programmatic Usage (Node.js)
Quick Start
import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';
// Parse a sitemap
const parser = new SitemapParser();
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');
// Fetch pages
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const results = await fetcher.fetchPages(sitemap.urls.slice(0, 10));
// Extract metadata
const extractor = new MetadataExtractor();
const metadata = await extractor.extractFromPages(results);
// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(educational);
API Reference
SitemapParser
Parse XML sitemaps and extract URLs.
const parser = new SitemapParser({
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Parse from string
const sitemap = await parser.parseSitemap(xmlContent);
// Parse from URL
const sitemap = await parser.parseFromUrl('https://example.com/sitemap.xml');
// Validate URL
const isValid = SitemapParser.isValidSitemapUrl(url);
// Filter educational URLs
const filtered = SitemapParser.filterEducationalUrls(sitemap.urls);
PageFetcher
Fetch web pages with concurrency control and rate limiting.
const fetcher = new PageFetcher({
maxConcurrency: 5,
timeout: 30000,
delayBetweenRequests: 100,
retryAttempts: 2,
retryDelay: 1000,
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Fetch multiple pages
const results = await fetcher.fetchPages(urls);
// Fetch single page
const result = await fetcher.fetchSinglePage('https://example.com/page');
// Validate URL
const isValid = PageFetcher.isValidUrl(url);
// Filter valid URLs
const validUrls = PageFetcher.filterValidUrls(urls);
MetadataExtractor
Extract metadata including JSON-LD from HTML pages.
const extractor = new MetadataExtractor({
validateSchema: false,
logger: (msg, level) => console.log(`[${level}] ${msg}`)
});
// Extract from multiple pages
const metadata = await extractor.extractFromPages(fetchResults);
// Extract from single page
const metadata = await extractor.extractFromPage(fetchResult);
// Filter educational metadata
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
// Check if has valid content
const hasContent = MetadataExtractor.hasValidContent(metadata);
Types
All types are exported and can be imported:
import type {
SitemapUrl,
ParsedSitemap,
FetchResult,
FetchOptions,
ExtractedMetadata,
LoggerFunction,
} from 'amb-sitemap-parser';
Tree-Shakeable Imports
Import only what you need for smaller bundle sizes:
import { SitemapParser } from 'amb-sitemap-parser/sitemap';
import { PageFetcher } from 'amb-sitemap-parser/fetcher';
import { MetadataExtractor } from 'amb-sitemap-parser/extractor';
Examples
Complete Workflow
import { SitemapParser, PageFetcher, MetadataExtractor } from 'amb-sitemap-parser';
async function processSitemap(sitemapUrl: string) {
// Initialize components
const parser = new SitemapParser();
const fetcher = new PageFetcher({ maxConcurrency: 5 });
const extractor = new MetadataExtractor();
// Parse sitemap
const sitemap = await parser.parseFromUrl(sitemapUrl);
console.log(`Found ${sitemap.urls.length} URLs`);
// Limit to first 50 URLs
const urlsToProcess = sitemap.urls.slice(0, 50);
// Fetch pages
const results = await fetcher.fetchPages(urlsToProcess);
console.log(`Fetched ${results.filter(r => r.success).length} pages successfully`);
// Extract metadata
const metadata = await extractor.extractFromPages(results);
// Filter educational content
const educational = MetadataExtractor.filterEducationalMetadata(metadata);
console.log(`Found ${educational.length} educational resources`);
return educational;
}
With Custom Logger
const logger = (message: string, level: 'info' | 'warn' | 'error') => {
const timestamp = new Date().toISOString();
console.log(`[${timestamp}] [${level.toUpperCase()}] ${message}`);
};
const parser = new SitemapParser({ logger });
const fetcher = new PageFetcher({ logger, maxConcurrency: 3 });
const extractor = new MetadataExtractor({ logger });
Development
# Install dependencies
npm install
# Build the library
npm run build
# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with UI
npm run test:ui
# Generate coverage report
npm run test:coverage
# Lint code
npm run lint
# Format code
npm run format
Related Projects
- AMB-Nostr Converter - Convert AMB metadata to Nostr events (kind 30142)
- AMB Specification - General Metadata Profile for Learning Resources
- AMB-NIP (kind 30142) - Nostr event spec for AMB
License
Unlicense
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.