# HoopScout v2 (Foundation Reset) HoopScout v2 is a controlled greenfield rebuild inside the existing repository. Current v2 foundation scope in this branch: - Django + HTMX server-rendered app - PostgreSQL as the only primary database - nginx reverse proxy - management-command-driven runtime operations - static snapshot directories persisted via Docker named volumes - strict JSON snapshot schema + import management command Out of scope in this step: - extractor implementation ## Runtime Architecture (v2) Runtime services are intentionally small: - `web` (Django/Gunicorn) - `postgres` (primary DB) - `nginx` (reverse proxy + static/media serving) - optional `scheduler` profile service (runs daily extractor/import loop) No Redis/Celery services are part of the v2 default runtime topology. Legacy Celery/provider code remains in-repo but is isolated behind `LEGACY_PROVIDER_STACK_ENABLED=1`. Default v2 runtime keeps that stack disabled. ## Image Strategy Compose builds and tags images as: - `registry.younerd.org/hoopscout/web:${APP_IMAGE_TAG:-latest}` - `registry.younerd.org/hoopscout/nginx:${NGINX_IMAGE_TAG:-latest}` Reserved for future optional scheduler use: - `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}` ## Entrypoint Strategy - `web`: `entrypoint.sh` - waits for PostgreSQL - optionally runs migrations/collectstatic - ensures snapshot directories exist - `nginx`: `nginx/entrypoint.sh` - simple runtime entrypoint wrapper ## Compose Files - `docker-compose.yml`: production-minded baseline runtime (immutable image filesystem) - `docker-compose.dev.yml`: development override with source bind mount for `web` - `docker-compose.release.yml`: production settings override (`DJANGO_SETTINGS_MODULE=config.settings.production`) ### Start development runtime ```bash cp .env.example .env docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build ``` ### Start release-style runtime ```bash docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build ``` ### Start scheduler profile (optional) ```bash docker compose --profile scheduler up -d scheduler ``` For development override: ```bash docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler ``` ## Named Volumes v2 runtime uses named volumes for persistence: - `postgres_data` - `static_data` - `media_data` - `snapshots_incoming` - `snapshots_archive` - `snapshots_failed` Development override uses separate dev-prefixed volumes to avoid ownership collisions. ## Environment Variables Use `.env.example` as the source of truth. Core groups: - Django runtime/security vars - PostgreSQL connection vars - image tag vars (`APP_IMAGE_TAG`, `NGINX_IMAGE_TAG`) - snapshot directory vars (`STATIC_DATASET_*`) - optional future scheduler vars (`SCHEDULER_*`) - daily orchestration vars (`DAILY_ORCHESTRATION_*`) - optional legacy provider-sync toggle (`LEGACY_PROVIDER_STACK_ENABLED`) ## Snapshot Storage Convention Snapshot files are expected under: - incoming: `/app/snapshots/incoming` - archive: `/app/snapshots/archive` - failed: `/app/snapshots/failed` Configured via environment: - `STATIC_DATASET_INCOMING_DIR` - `STATIC_DATASET_ARCHIVE_DIR` - `STATIC_DATASET_FAILED_DIR` ## Snapshot JSON Schema (MVP) Each file must be a JSON object: ```json { "source_name": "official_site_feed", "snapshot_date": "2026-03-13", "records": [ { "competition_external_id": "comp-nba", "competition_name": "NBA", "season": "2025-2026", "team_external_id": "team-lal", "team_name": "Los Angeles Lakers", "player_external_id": "player-23", "full_name": "LeBron James", "first_name": "LeBron", "last_name": "James", "birth_date": "1984-12-30", "nationality": "US", "height_cm": 206, "weight_kg": 113, "position": "SF", "role": "Primary Creator", "games_played": 60, "minutes_per_game": 34.5, "points_per_game": 25.4, "rebounds_per_game": 7.2, "assists_per_game": 8.1, "steals_per_game": 1.3, "blocks_per_game": 0.7, "turnovers_per_game": 3.2, "fg_pct": 51.1, "three_pt_pct": 38.4, "ft_pct": 79.8, "source_metadata": {}, "raw_payload": {} } ], "source_metadata": {}, "raw_payload": {} } ``` Validation is strict: - unknown fields are rejected - required fields must exist: - `competition_external_id`, `competition_name`, `season` - `team_external_id`, `team_name` - `player_external_id`, `full_name` - core stats (`games_played`, `minutes_per_game`, `points_per_game`, `rebounds_per_game`, `assists_per_game`, `steals_per_game`, `blocks_per_game`, `turnovers_per_game`, `fg_pct`, `three_pt_pct`, `ft_pct`) - optional player bio/physical fields: - `first_name`, `last_name`, `birth_date`, `nationality`, `height_cm`, `weight_kg`, `position`, `role` - when `birth_date` is provided it must be `YYYY-MM-DD` - numeric fields must be numeric - invalid files are moved to failed directory Importer enrichment note: - `full_name` is source truth for identity display - `first_name` / `last_name` are optional and may be absent in public snapshots - when both are missing, importer may derive them from `full_name` as a best-effort enrichment step - this enrichment is convenience-only and does not override source truth semantics ## Import Command Run import: ```bash docker compose exec web python manage.py import_snapshots ``` Run end-to-end daily orchestration manually (extractors -> import): ```bash docker compose exec web python manage.py run_daily_orchestration ``` Command behavior: - scans `STATIC_DATASET_INCOMING_DIR` for `.json` files - validates strict schema - computes SHA-256 checksum - creates `ImportRun` + `ImportFile` records - upserts relational entities (`Competition`, `Season`, `Team`, `Player`, `PlayerSeason`, `PlayerSeasonStats`) - skips duplicate content using checksum - moves valid files to archive - moves invalid files to failed ### Source Identity Namespacing Raw external IDs are **not globally unique** across basketball data sources. HoopScout v2 uses a namespaced identity for imported entities: - `Competition`: unique key is `(source_name, source_uid)` - `Team`: unique key is `(source_name, source_uid)` - `Player`: unique key is `(source_name, source_uid)` `source_uid` values from different sources (for example `lba` and `bcl`) can safely overlap without overwriting each other. Import history is visible in Django admin: - `ImportRun` - `ImportFile` ## Extractor Framework (v2) v2 keeps extraction and import as two separate steps: 1. **Extractors** fetch public source content and emit normalized JSON snapshots. 2. **Importer** (`import_snapshots`) validates and upserts those snapshots into PostgreSQL. Extractor pipeline: - `fetch` (public endpoint/page requests with conservative HTTP behavior) - `parse` (source-specific structure) - `normalize` (map to HoopScout snapshot schema) - `emit` (write JSON file to incoming directory or custom path) Built-in extractor in this phase: - `public_json_snapshot` (generic JSON feed extractor for MVP usage) - `lba` (Lega Basket Serie A MVP extractor) - `bcl` (Basketball Champions League MVP extractor) Run extractor: ```bash docker compose exec web python manage.py run_extractor public_json_snapshot ``` Run extractor with explicit output path (debugging): ```bash docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming ``` Dry-run validation (no file write): ```bash docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run ``` Run only the LBA extractor: ```bash docker compose exec web python manage.py run_lba_extractor ``` Run only the BCL extractor: ```bash docker compose exec web python manage.py run_bcl_extractor ``` ### Daily orchestration behavior `run_daily_orchestration` performs: 1. run configured extractors in order from `DAILY_ORCHESTRATION_EXTRACTORS` 2. write snapshots to incoming dir 3. run `import_snapshots` 4. log extractor/import summary Extractor environment variables: - `EXTRACTOR_USER_AGENT` - `EXTRACTOR_HTTP_TIMEOUT_SECONDS` - `EXTRACTOR_HTTP_RETRIES` - `EXTRACTOR_RETRY_SLEEP_SECONDS` - `EXTRACTOR_REQUEST_DELAY_SECONDS` - `EXTRACTOR_PUBLIC_JSON_URL` - `EXTRACTOR_PUBLIC_SOURCE_NAME` - `EXTRACTOR_INCLUDE_RAW_PAYLOAD` - `EXTRACTOR_LBA_STATS_URL` - `EXTRACTOR_LBA_SEASON_LABEL` - `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID` - `EXTRACTOR_LBA_COMPETITION_NAME` - `EXTRACTOR_BCL_STATS_URL` - `EXTRACTOR_BCL_SEASON_LABEL` - `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID` - `EXTRACTOR_BCL_COMPETITION_NAME` - `DAILY_ORCHESTRATION_EXTRACTORS` - `DAILY_ORCHESTRATION_INTERVAL_SECONDS` Notes: - extraction is intentionally low-frequency and uses retries conservatively - only public pages/endpoints should be targeted - emitted snapshots must match the same schema consumed by `import_snapshots` - `public_json_snapshot` uses the same required-vs-optional field contract as `SnapshotSchemaValidator` (no stricter extractor-only required bio/physical fields) - optional scheduler container runs `scripts/scheduler.sh` loop using: - image: `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}` - command: `/app/scripts/scheduler.sh` - interval: `DAILY_ORCHESTRATION_INTERVAL_SECONDS` - disabled idle interval: `SCHEDULER_DISABLED_SLEEP_SECONDS` ### Scheduler entrypoint/runtime expectations - scheduler uses the same app image and base `entrypoint.sh` as web - scheduler requires database connectivity and snapshot volumes - scheduler is disabled unless: - compose `scheduler` profile is started - `SCHEDULER_ENABLED=1` - if scheduler service is started while disabled (`SCHEDULER_ENABLED=0`), it does not exit; it enters idle sleep mode to avoid restart loops with `restart: unless-stopped` - this keeps default runtime simple while supporting daily automation ### LBA extractor assumptions and limitations (MVP) - `source_name` is fixed to `lba` - the extractor expects one stable public JSON payload that includes player/team/stat rows - competition is configured by environment and emitted as: - `competition_external_id` from `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID` - `competition_name` from `EXTRACTOR_LBA_COMPETITION_NAME` - season is configured by `EXTRACTOR_LBA_SEASON_LABEL` - parser supports payload keys: `records`, `data`, `players`, `items` - normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`) - public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits `null` for optional fields - no live HTTP calls in tests; tests use fixtures/mocked responses only ### BCL extractor assumptions and limitations (MVP) - `source_name` is fixed to `bcl` - the extractor expects one stable public JSON payload that includes player/team/stat rows - competition is configured by environment and emitted as: - `competition_external_id` from `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID` - `competition_name` from `EXTRACTOR_BCL_COMPETITION_NAME` - season is configured by `EXTRACTOR_BCL_SEASON_LABEL` - parser supports payload keys: `records`, `data`, `players`, `items` - normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`) - public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits `null` for optional fields - no live HTTP calls in tests; tests use fixtures/mocked responses only ## Testing - runtime `web` image stays lean and may not include `pytest` tooling - run tests with the development compose stack (or a dedicated test image/profile) and install dev dependencies first - local example (one-off): ```bash docker compose -f docker-compose.yml -f docker-compose.dev.yml run --rm web sh -lc "export PYTHONUSERBASE=/tmp/pyuser && python -m pip install --user -r requirements/dev.txt && python -m pytest -q" ``` ## Migration and Superuser Commands ```bash docker compose exec web python manage.py migrate docker compose exec web python manage.py createsuperuser ``` ## Health Endpoints - app health: `/health/` - nginx healthcheck proxies `/health/` to `web` ## Player Search (v2) Public player search is server-rendered (Django templates) with HTMX partial updates. Supported filters: - free text name search - nominal position, inferred role - competition, season, team - nationality - age, height, weight ranges - stats thresholds: games, MPG, PPG, RPG, APG, SPG, BPG, TOV, FG%, 3P%, FT% Search correctness: - combined team/competition/season/stat filters are applied to the same `PlayerSeason` context (no cross-row false positives) - filtering happens at database level with Django ORM Search metric semantics: - result columns are labeled as **Best Eligible** - each displayed metric is `MAX` over eligible player-season rows for that metric in the current filter context - different metric columns for one player may come from different eligible seasons - when no eligible value exists for a metric in the current context, the UI shows `-` Pagination and sorting: - querystring is preserved - HTMX navigation keeps URL state in sync with current filters/page/sort ## Saved Searches and Watchlist (v2) Authenticated users can: - save current search filters from the player search page - re-run saved searches from scouting pages - rename/update/delete saved searches - update saved search filters via structured JSON in the edit screen - add/remove favorite players inline (HTMX-friendly) and browse watchlist ## GitFlow Required branch model: - `main`: production - `develop`: integration - `feature/*`, `release/*`, `hotfix/*` This v2 work branch is: - `feature/hoopscout-v2-static-architecture` ## Notes on Legacy Layers Legacy provider/Celery ingestion layers are not the default runtime path for v2 foundation. They are intentionally isolated until replaced by v2 snapshot ingestion commands in later tasks. By default: - `apps.providers` is not installed - `/providers/` routes are not mounted - legacy provider-specific settings are not required