feat(v2): add snapshot extractor framework and run command

This commit is contained in:
Alfredo Di Stasio
2026-03-13 14:24:54 +01:00
parent 6fc583c79f
commit 850e4de71b
10 changed files with 796 additions and 0 deletions

View File

@ -169,6 +169,55 @@ Import history is visible in Django admin:
- `ImportRun`
- `ImportFile`
## Extractor Framework (v2)
v2 keeps extraction and import as two separate steps:
1. **Extractors** fetch public source content and emit normalized JSON snapshots.
2. **Importer** (`import_snapshots`) validates and upserts those snapshots into PostgreSQL.
Extractor pipeline:
- `fetch` (public endpoint/page requests with conservative HTTP behavior)
- `parse` (source-specific structure)
- `normalize` (map to HoopScout snapshot schema)
- `emit` (write JSON file to incoming directory or custom path)
Built-in extractor in this phase:
- `public_json_snapshot` (generic JSON feed extractor for MVP usage)
Run extractor:
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot
```
Run extractor with explicit output path (debugging):
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming
```
Dry-run validation (no file write):
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run
```
Extractor environment variables:
- `EXTRACTOR_USER_AGENT`
- `EXTRACTOR_HTTP_TIMEOUT_SECONDS`
- `EXTRACTOR_HTTP_RETRIES`
- `EXTRACTOR_RETRY_SLEEP_SECONDS`
- `EXTRACTOR_REQUEST_DELAY_SECONDS`
- `EXTRACTOR_PUBLIC_JSON_URL`
- `EXTRACTOR_PUBLIC_SOURCE_NAME`
- `EXTRACTOR_INCLUDE_RAW_PAYLOAD`
Notes:
- extraction is intentionally low-frequency and uses retries conservatively
- only public pages/endpoints should be targeted
- emitted snapshots must match the same schema consumed by `import_snapshots`
## Migration and Superuser Commands
```bash