feat(v2): add simple daily extraction-import orchestration

This commit is contained in:
Alfredo Di Stasio
2026-03-13 14:37:17 +01:00
parent 5df973467d
commit 0ed4fc57b8
10 changed files with 336 additions and 1 deletions

View File

@ -19,6 +19,7 @@ Runtime services are intentionally small:
- `web` (Django/Gunicorn)
- `postgres` (primary DB)
- `nginx` (reverse proxy + static/media serving)
- optional `scheduler` profile service (runs daily extractor/import loop)
No Redis/Celery services are part of the v2 default runtime topology.
Legacy Celery/provider code is still in repository history/codebase but de-emphasized for v2.
@ -60,6 +61,18 @@ docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build
docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build
```
### Start scheduler profile (optional)
```bash
docker compose --profile scheduler up -d scheduler
```
For development override:
```bash
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler
```
## Named Volumes
v2 runtime uses named volumes for persistence:
@ -82,6 +95,7 @@ Core groups:
- image tag vars (`APP_IMAGE_TAG`, `NGINX_IMAGE_TAG`)
- snapshot directory vars (`STATIC_DATASET_*`)
- optional future scheduler vars (`SCHEDULER_*`)
- daily orchestration vars (`DAILY_ORCHESTRATION_*`)
## Snapshot Storage Convention
@ -155,6 +169,12 @@ Run import:
docker compose exec web python manage.py import_snapshots
```
Run end-to-end daily orchestration manually (extractors -> import):
```bash
docker compose exec web python manage.py run_daily_orchestration
```
Command behavior:
- scans `STATIC_DATASET_INCOMING_DIR` for `.json` files
- validates strict schema
@ -217,6 +237,14 @@ Run only the BCL extractor:
docker compose exec web python manage.py run_bcl_extractor
```
### Daily orchestration behavior
`run_daily_orchestration` performs:
1. run configured extractors in order from `DAILY_ORCHESTRATION_EXTRACTORS`
2. write snapshots to incoming dir
3. run `import_snapshots`
4. log extractor/import summary
Extractor environment variables:
- `EXTRACTOR_USER_AGENT`
- `EXTRACTOR_HTTP_TIMEOUT_SECONDS`
@ -234,11 +262,26 @@ Extractor environment variables:
- `EXTRACTOR_BCL_SEASON_LABEL`
- `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID`
- `EXTRACTOR_BCL_COMPETITION_NAME`
- `DAILY_ORCHESTRATION_EXTRACTORS`
- `DAILY_ORCHESTRATION_INTERVAL_SECONDS`
Notes:
- extraction is intentionally low-frequency and uses retries conservatively
- only public pages/endpoints should be targeted
- emitted snapshots must match the same schema consumed by `import_snapshots`
- optional scheduler container runs `scripts/scheduler.sh` loop using:
- image: `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}`
- command: `/app/scripts/scheduler.sh`
- interval: `DAILY_ORCHESTRATION_INTERVAL_SECONDS`
### Scheduler entrypoint/runtime expectations
- scheduler uses the same app image and base `entrypoint.sh` as web
- scheduler requires database connectivity and snapshot volumes
- scheduler is disabled unless:
- compose `scheduler` profile is started
- `SCHEDULER_ENABLED=1`
- this keeps default runtime simple while supporting daily automation
### LBA extractor assumptions and limitations (MVP)