feat(v2): add snapshot extractor framework and run command
This commit is contained in:
49
README.md
49
README.md
@ -169,6 +169,55 @@ Import history is visible in Django admin:
|
||||
- `ImportRun`
|
||||
- `ImportFile`
|
||||
|
||||
## Extractor Framework (v2)
|
||||
|
||||
v2 keeps extraction and import as two separate steps:
|
||||
|
||||
1. **Extractors** fetch public source content and emit normalized JSON snapshots.
|
||||
2. **Importer** (`import_snapshots`) validates and upserts those snapshots into PostgreSQL.
|
||||
|
||||
Extractor pipeline:
|
||||
- `fetch` (public endpoint/page requests with conservative HTTP behavior)
|
||||
- `parse` (source-specific structure)
|
||||
- `normalize` (map to HoopScout snapshot schema)
|
||||
- `emit` (write JSON file to incoming directory or custom path)
|
||||
|
||||
Built-in extractor in this phase:
|
||||
- `public_json_snapshot` (generic JSON feed extractor for MVP usage)
|
||||
|
||||
Run extractor:
|
||||
|
||||
```bash
|
||||
docker compose exec web python manage.py run_extractor public_json_snapshot
|
||||
```
|
||||
|
||||
Run extractor with explicit output path (debugging):
|
||||
|
||||
```bash
|
||||
docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming
|
||||
```
|
||||
|
||||
Dry-run validation (no file write):
|
||||
|
||||
```bash
|
||||
docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run
|
||||
```
|
||||
|
||||
Extractor environment variables:
|
||||
- `EXTRACTOR_USER_AGENT`
|
||||
- `EXTRACTOR_HTTP_TIMEOUT_SECONDS`
|
||||
- `EXTRACTOR_HTTP_RETRIES`
|
||||
- `EXTRACTOR_RETRY_SLEEP_SECONDS`
|
||||
- `EXTRACTOR_REQUEST_DELAY_SECONDS`
|
||||
- `EXTRACTOR_PUBLIC_JSON_URL`
|
||||
- `EXTRACTOR_PUBLIC_SOURCE_NAME`
|
||||
- `EXTRACTOR_INCLUDE_RAW_PAYLOAD`
|
||||
|
||||
Notes:
|
||||
- extraction is intentionally low-frequency and uses retries conservatively
|
||||
- only public pages/endpoints should be targeted
|
||||
- emitted snapshots must match the same schema consumed by `import_snapshots`
|
||||
|
||||
## Migration and Superuser Commands
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user