Redesign crawler management around imported Python scripts instead of built-in crawler storage. Crawler scripts now declare CRAWLER_NAME, imports validate metadata, crawler IDs are generated internally, and deleted crawler scripts are detached without deleting already imported videos. Add backend support for file and URL script imports, dry-run testing, metadata parsing, safer job paths, original filename preservation, and crawler listing that ignores detached script records. Remove the legacy built-in Spider91 script path flow and hidden Python/config JSON fields from the crawler API. Rework the admin crawler page into an independent crawler console with script import, dry-run testing, status metrics, spider iconography, and simplified controls. Update docs, examples, installer checks, Docker/release packaging, and tests for the new protocol.
3.6 KiB
Crawler Script Protocol v1
Crawler scripts are external processes. The Go backend is the host: it handles dedupe, downloading, catalog writes, thumbnails, preview videos, fingerprints, task status and cancellation.
Invocation
Every script must declare a static crawler name near the top of the Python file. The admin page reads this value when importing the script; users do not type the crawler name manually.
CRAWLER_NAME = "Example Crawler"
The backend runs:
python3 /path/to/crawler.py --job /path/to/job.json
job.json:
{
"protocol": "crawler.v1",
"mode": "crawl",
"run_id": "20260609T120000Z",
"crawler_id": "example",
"target_new": 10,
"seen_source_ids_file": "/data/scriptcrawlers/example/.crawl/seen.txt",
"output_dir": "/data/scriptcrawlers/example/output",
"config": {
"category": "hot"
},
"network": {
"proxy_url": "http://127.0.0.1:7890"
}
}
Importing Scripts
Crawler scripts are configured from the admin crawler page. A script can be uploaded as a local file or imported from an HTTP(S) URL.
Imported scripts are copied into crawler-scripts/ next to the configured local
preview data directory. The import API currently accepts Python files only
(.py) and rejects empty files, files larger than 2 MiB, or scripts without
CRAWLER_NAME.
Output
stdout must be JSON Lines. Logs must go to stderr.
Recommended item event:
{
"type": "item",
"title": "Video title",
"media_url": "https://cdn.example.test/video.mp4",
"thumbnail_url": "https://cdn.example.test/cover.jpg",
"source_id": "site-native-id",
"headers": {
"Referer": "https://example.test/"
}
}
Minimum item event:
{"type":"item","title":"Video title","media_url":"https://cdn.example.test/video.mp4"}
If a line contains item fields such as title and media_url, the backend also
treats it as an item when type is omitted.
The item fields may also be wrapped inside "item" if that is more convenient:
{"type":"item","item":{"title":"Video title","media_url":"https://cdn.example.test/video.mp4"}}
Optional progress/done events:
{"type":"progress","checked":20,"emitted":3}
{"type":"done","stats":{"emitted":10}}
Simple Field Rules
titleis required.media_urlis required for normal scripts. The backend downloads the video.thumbnail_urlis optional. If it is empty, the backend generates a thumbnail from the downloaded video.source_idis optional but recommended. If present, it should be stable within one crawler and lets the backend skip known videos before downloading. If it is empty, the backend creates an internalauto-...ID and later relies on the existing video fingerprint dedupe path.headersis optional and is applied to both video and thumbnail downloads. Use it forReferer, cookies or anti-hotlinking requirements.
Advanced Fields
detail_url,author,tags,category,quality,duration_seconds,descriptionandpublished_atare optional metadata fields.- If video and thumbnail need different headers, use
media_headersandthumbnail_headers. - Existing nested fields are still supported for compatibility:
media.url,media.local_file,media.headers,thumbnail.url,thumbnail.local_file,thumbnail.headers. - Advanced scripts may download into
job.output_dirand returnmedia_local_fileormedia.local_file. The path must stay insideoutput_dir. - Scripts can read
seen_source_ids_fileand skip known IDs when they provide stablesource_idvalues. The backend still dedupes every item. - The backend stops the process after
target_newnew videos are imported.