Files
hoopscout/README.md
2026-03-20 15:08:20 +01:00

380 lines
12 KiB
Markdown

# HoopScout v2 (Foundation Reset)
HoopScout v2 is a controlled greenfield rebuild inside the existing repository.
Current v2 foundation scope in this branch:
- Django + HTMX server-rendered app
- PostgreSQL as the only primary database
- nginx reverse proxy
- management-command-driven runtime operations
- static snapshot directories persisted via Docker named volumes
- strict JSON snapshot schema + import management command
Out of scope in this step:
- extractor implementation
## Runtime Architecture (v2)
Runtime services are intentionally small:
- `web` (Django/Gunicorn)
- `postgres` (primary DB)
- `nginx` (reverse proxy + static/media serving)
- optional `scheduler` profile service (runs daily extractor/import loop)
No Redis/Celery services are part of the v2 default runtime topology.
Legacy Celery/provider code is still in repository history/codebase but de-emphasized for v2.
## Image Strategy
Compose builds and tags images as:
- `registry.younerd.org/hoopscout/web:${APP_IMAGE_TAG:-latest}`
- `registry.younerd.org/hoopscout/nginx:${NGINX_IMAGE_TAG:-latest}`
Reserved for future optional scheduler use:
- `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}`
## Entrypoint Strategy
- `web`: `entrypoint.sh`
- waits for PostgreSQL
- optionally runs migrations/collectstatic
- ensures snapshot directories exist
- `nginx`: `nginx/entrypoint.sh`
- simple runtime entrypoint wrapper
## Compose Files
- `docker-compose.yml`: production-minded baseline runtime (immutable image filesystem)
- `docker-compose.dev.yml`: development override with source bind mount for `web`
- `docker-compose.release.yml`: production settings override (`DJANGO_SETTINGS_MODULE=config.settings.production`)
### Start development runtime
```bash
cp .env.example .env
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build
```
### Start release-style runtime
```bash
docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build
```
### Start scheduler profile (optional)
```bash
docker compose --profile scheduler up -d scheduler
```
For development override:
```bash
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler
```
## Named Volumes
v2 runtime uses named volumes for persistence:
- `postgres_data`
- `static_data`
- `media_data`
- `snapshots_incoming`
- `snapshots_archive`
- `snapshots_failed`
Development override uses separate dev-prefixed volumes to avoid ownership collisions.
## Environment Variables
Use `.env.example` as the source of truth.
Core groups:
- Django runtime/security vars
- PostgreSQL connection vars
- image tag vars (`APP_IMAGE_TAG`, `NGINX_IMAGE_TAG`)
- snapshot directory vars (`STATIC_DATASET_*`)
- optional future scheduler vars (`SCHEDULER_*`)
- daily orchestration vars (`DAILY_ORCHESTRATION_*`)
## Snapshot Storage Convention
Snapshot files are expected under:
- incoming: `/app/snapshots/incoming`
- archive: `/app/snapshots/archive`
- failed: `/app/snapshots/failed`
Configured via environment:
- `STATIC_DATASET_INCOMING_DIR`
- `STATIC_DATASET_ARCHIVE_DIR`
- `STATIC_DATASET_FAILED_DIR`
## Snapshot JSON Schema (MVP)
Each file must be a JSON object:
```json
{
"source_name": "official_site_feed",
"snapshot_date": "2026-03-13",
"records": [
{
"competition_external_id": "comp-nba",
"competition_name": "NBA",
"season": "2025-2026",
"team_external_id": "team-lal",
"team_name": "Los Angeles Lakers",
"player_external_id": "player-23",
"full_name": "LeBron James",
"first_name": "LeBron",
"last_name": "James",
"birth_date": "1984-12-30",
"nationality": "US",
"height_cm": 206,
"weight_kg": 113,
"position": "SF",
"role": "Primary Creator",
"games_played": 60,
"minutes_per_game": 34.5,
"points_per_game": 25.4,
"rebounds_per_game": 7.2,
"assists_per_game": 8.1,
"steals_per_game": 1.3,
"blocks_per_game": 0.7,
"turnovers_per_game": 3.2,
"fg_pct": 51.1,
"three_pt_pct": 38.4,
"ft_pct": 79.8,
"source_metadata": {},
"raw_payload": {}
}
],
"source_metadata": {},
"raw_payload": {}
}
```
Validation is strict:
- unknown fields are rejected
- required fields must exist
- `snapshot_date` and `birth_date` must be `YYYY-MM-DD`
- numeric fields must be numeric
- invalid files are moved to failed directory
## Import Command
Run import:
```bash
docker compose exec web python manage.py import_snapshots
```
Run end-to-end daily orchestration manually (extractors -> import):
```bash
docker compose exec web python manage.py run_daily_orchestration
```
Command behavior:
- scans `STATIC_DATASET_INCOMING_DIR` for `.json` files
- validates strict schema
- computes SHA-256 checksum
- creates `ImportRun` + `ImportFile` records
- upserts relational entities (`Competition`, `Season`, `Team`, `Player`, `PlayerSeason`, `PlayerSeasonStats`)
- skips duplicate content using checksum
- moves valid files to archive
- moves invalid files to failed
### Source Identity Namespacing
Raw external IDs are **not globally unique** across basketball data sources. HoopScout v2 uses a namespaced identity for imported entities:
- `Competition`: unique key is `(source_name, source_uid)`
- `Team`: unique key is `(source_name, source_uid)`
- `Player`: unique key is `(source_name, source_uid)`
`source_uid` values from different sources (for example `lba` and `bcl`) can safely overlap without overwriting each other.
Import history is visible in Django admin:
- `ImportRun`
- `ImportFile`
## Extractor Framework (v2)
v2 keeps extraction and import as two separate steps:
1. **Extractors** fetch public source content and emit normalized JSON snapshots.
2. **Importer** (`import_snapshots`) validates and upserts those snapshots into PostgreSQL.
Extractor pipeline:
- `fetch` (public endpoint/page requests with conservative HTTP behavior)
- `parse` (source-specific structure)
- `normalize` (map to HoopScout snapshot schema)
- `emit` (write JSON file to incoming directory or custom path)
Built-in extractor in this phase:
- `public_json_snapshot` (generic JSON feed extractor for MVP usage)
- `lba` (Lega Basket Serie A MVP extractor)
- `bcl` (Basketball Champions League MVP extractor)
Run extractor:
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot
```
Run extractor with explicit output path (debugging):
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming
```
Dry-run validation (no file write):
```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run
```
Run only the LBA extractor:
```bash
docker compose exec web python manage.py run_lba_extractor
```
Run only the BCL extractor:
```bash
docker compose exec web python manage.py run_bcl_extractor
```
### Daily orchestration behavior
`run_daily_orchestration` performs:
1. run configured extractors in order from `DAILY_ORCHESTRATION_EXTRACTORS`
2. write snapshots to incoming dir
3. run `import_snapshots`
4. log extractor/import summary
Extractor environment variables:
- `EXTRACTOR_USER_AGENT`
- `EXTRACTOR_HTTP_TIMEOUT_SECONDS`
- `EXTRACTOR_HTTP_RETRIES`
- `EXTRACTOR_RETRY_SLEEP_SECONDS`
- `EXTRACTOR_REQUEST_DELAY_SECONDS`
- `EXTRACTOR_PUBLIC_JSON_URL`
- `EXTRACTOR_PUBLIC_SOURCE_NAME`
- `EXTRACTOR_INCLUDE_RAW_PAYLOAD`
- `EXTRACTOR_LBA_STATS_URL`
- `EXTRACTOR_LBA_SEASON_LABEL`
- `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID`
- `EXTRACTOR_LBA_COMPETITION_NAME`
- `EXTRACTOR_BCL_STATS_URL`
- `EXTRACTOR_BCL_SEASON_LABEL`
- `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID`
- `EXTRACTOR_BCL_COMPETITION_NAME`
- `DAILY_ORCHESTRATION_EXTRACTORS`
- `DAILY_ORCHESTRATION_INTERVAL_SECONDS`
Notes:
- extraction is intentionally low-frequency and uses retries conservatively
- only public pages/endpoints should be targeted
- emitted snapshots must match the same schema consumed by `import_snapshots`
- optional scheduler container runs `scripts/scheduler.sh` loop using:
- image: `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}`
- command: `/app/scripts/scheduler.sh`
- interval: `DAILY_ORCHESTRATION_INTERVAL_SECONDS`
### Scheduler entrypoint/runtime expectations
- scheduler uses the same app image and base `entrypoint.sh` as web
- scheduler requires database connectivity and snapshot volumes
- scheduler is disabled unless:
- compose `scheduler` profile is started
- `SCHEDULER_ENABLED=1`
- this keeps default runtime simple while supporting daily automation
### LBA extractor assumptions and limitations (MVP)
- `source_name` is fixed to `lba`
- the extractor expects one stable public JSON payload that includes player/team/stat rows
- competition is configured by environment and emitted as:
- `competition_external_id` from `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID`
- `competition_name` from `EXTRACTOR_LBA_COMPETITION_NAME`
- season is configured by `EXTRACTOR_LBA_SEASON_LABEL`
- parser supports payload keys: `records`, `data`, `players`, `items`
- normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`)
- no live HTTP calls in tests; tests use fixtures/mocked responses only
### BCL extractor assumptions and limitations (MVP)
- `source_name` is fixed to `bcl`
- the extractor expects one stable public JSON payload that includes player/team/stat rows
- competition is configured by environment and emitted as:
- `competition_external_id` from `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID`
- `competition_name` from `EXTRACTOR_BCL_COMPETITION_NAME`
- season is configured by `EXTRACTOR_BCL_SEASON_LABEL`
- parser supports payload keys: `records`, `data`, `players`, `items`
- normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`)
- no live HTTP calls in tests; tests use fixtures/mocked responses only
## Migration and Superuser Commands
```bash
docker compose exec web python manage.py migrate
docker compose exec web python manage.py createsuperuser
```
## Health Endpoints
- app health: `/health/`
- nginx healthcheck proxies `/health/` to `web`
## Player Search (v2)
Public player search is server-rendered (Django templates) with HTMX partial updates.
Supported filters:
- free text name search
- nominal position, inferred role
- competition, season, team
- nationality
- age, height, weight ranges
- stats thresholds: games, MPG, PPG, RPG, APG, SPG, BPG, TOV, FG%, 3P%, FT%
Search correctness:
- combined team/competition/season/stat filters are applied to the same `PlayerSeason` context (no cross-row false positives)
- filtering happens at database level with Django ORM
Search metric semantics:
- result columns are labeled as **Best Eligible**
- each displayed metric is `MAX` over eligible player-season rows for that metric in the current filter context
- different metric columns for one player may come from different eligible seasons
- when no eligible value exists for a metric in the current context, the UI shows `-`
Pagination and sorting:
- querystring is preserved
- HTMX navigation keeps URL state in sync with current filters/page/sort
## Saved Searches and Watchlist (v2)
Authenticated users can:
- save current search filters from the player search page
- re-run saved searches from scouting pages
- rename/update/delete saved searches
- update saved search filters via structured JSON in the edit screen
- add/remove favorite players inline (HTMX-friendly) and browse watchlist
## GitFlow
Required branch model:
- `main`: production
- `develop`: integration
- `feature/*`, `release/*`, `hotfix/*`
This v2 work branch is:
- `feature/hoopscout-v2-static-architecture`
## Notes on Legacy Layers
Legacy provider/Celery ingestion layers are not the default runtime path for v2 foundation.
They are intentionally isolated until replaced by v2 snapshot ingestion commands in later tasks.