Files
hoopscout/README.md
2026-03-20 15:23:43 +01:00

14 KiB

HoopScout v2 (Foundation Reset)

HoopScout v2 is a controlled greenfield rebuild inside the existing repository.

Current v2 foundation scope in this branch:

  • Django + HTMX server-rendered app
  • PostgreSQL as the only primary database
  • nginx reverse proxy
  • management-command-driven runtime operations
  • static snapshot directories persisted via Docker named volumes
  • strict JSON snapshot schema + import management command

Out of scope in this step:

  • extractor implementation

Runtime Architecture (v2)

Runtime services are intentionally small:

  • web (Django/Gunicorn)
  • postgres (primary DB)
  • nginx (reverse proxy + static/media serving)
  • optional scheduler profile service (runs daily extractor/import loop)

No Redis/Celery services are part of the v2 default runtime topology. Legacy Celery/provider code is still in repository history/codebase but de-emphasized for v2.

Image Strategy

Compose builds and tags images as:

  • registry.younerd.org/hoopscout/web:${APP_IMAGE_TAG:-latest}
  • registry.younerd.org/hoopscout/nginx:${NGINX_IMAGE_TAG:-latest}

Reserved for future optional scheduler use:

  • registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}

Entrypoint Strategy

  • web: entrypoint.sh
    • waits for PostgreSQL
    • optionally runs migrations/collectstatic
    • ensures snapshot directories exist
  • nginx: nginx/entrypoint.sh
    • simple runtime entrypoint wrapper

Compose Files

  • docker-compose.yml: production-minded baseline runtime (immutable image filesystem)
  • docker-compose.dev.yml: development override with source bind mount for web
  • docker-compose.release.yml: production settings override (DJANGO_SETTINGS_MODULE=config.settings.production)

Start development runtime

cp .env.example .env
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build

Start release-style runtime

docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build

Start scheduler profile (optional)

docker compose --profile scheduler up -d scheduler

For development override:

docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler

Named Volumes

v2 runtime uses named volumes for persistence:

  • postgres_data
  • static_data
  • media_data
  • snapshots_incoming
  • snapshots_archive
  • snapshots_failed

Development override uses separate dev-prefixed volumes to avoid ownership collisions.

Environment Variables

Use .env.example as the source of truth.

Core groups:

  • Django runtime/security vars
  • PostgreSQL connection vars
  • image tag vars (APP_IMAGE_TAG, NGINX_IMAGE_TAG)
  • snapshot directory vars (STATIC_DATASET_*)
  • optional future scheduler vars (SCHEDULER_*)
  • daily orchestration vars (DAILY_ORCHESTRATION_*)

Snapshot Storage Convention

Snapshot files are expected under:

  • incoming: /app/snapshots/incoming
  • archive: /app/snapshots/archive
  • failed: /app/snapshots/failed

Configured via environment:

  • STATIC_DATASET_INCOMING_DIR
  • STATIC_DATASET_ARCHIVE_DIR
  • STATIC_DATASET_FAILED_DIR

Snapshot JSON Schema (MVP)

Each file must be a JSON object:

{
  "source_name": "official_site_feed",
  "snapshot_date": "2026-03-13",
  "records": [
    {
      "competition_external_id": "comp-nba",
      "competition_name": "NBA",
      "season": "2025-2026",
      "team_external_id": "team-lal",
      "team_name": "Los Angeles Lakers",
      "player_external_id": "player-23",
      "full_name": "LeBron James",
      "first_name": "LeBron",
      "last_name": "James",
      "birth_date": "1984-12-30",
      "nationality": "US",
      "height_cm": 206,
      "weight_kg": 113,
      "position": "SF",
      "role": "Primary Creator",
      "games_played": 60,
      "minutes_per_game": 34.5,
      "points_per_game": 25.4,
      "rebounds_per_game": 7.2,
      "assists_per_game": 8.1,
      "steals_per_game": 1.3,
      "blocks_per_game": 0.7,
      "turnovers_per_game": 3.2,
      "fg_pct": 51.1,
      "three_pt_pct": 38.4,
      "ft_pct": 79.8,
      "source_metadata": {},
      "raw_payload": {}
    }
  ],
  "source_metadata": {},
  "raw_payload": {}
}

Validation is strict:

  • unknown fields are rejected
  • required fields must exist:
    • competition_external_id, competition_name, season
    • team_external_id, team_name
    • player_external_id, full_name
    • core stats (games_played, minutes_per_game, points_per_game, rebounds_per_game, assists_per_game, steals_per_game, blocks_per_game, turnovers_per_game, fg_pct, three_pt_pct, ft_pct)
  • optional player bio/physical fields:
    • first_name, last_name, birth_date, nationality, height_cm, weight_kg, position, role
  • when birth_date is provided it must be YYYY-MM-DD
  • numeric fields must be numeric
  • invalid files are moved to failed directory

Importer enrichment note:

  • full_name is source truth for identity display
  • first_name / last_name are optional and may be absent in public snapshots
  • when both are missing, importer may derive them from full_name as a best-effort enrichment step
  • this enrichment is convenience-only and does not override source truth semantics

Import Command

Run import:

docker compose exec web python manage.py import_snapshots

Run end-to-end daily orchestration manually (extractors -> import):

docker compose exec web python manage.py run_daily_orchestration

Command behavior:

  • scans STATIC_DATASET_INCOMING_DIR for .json files
  • validates strict schema
  • computes SHA-256 checksum
  • creates ImportRun + ImportFile records
  • upserts relational entities (Competition, Season, Team, Player, PlayerSeason, PlayerSeasonStats)
  • skips duplicate content using checksum
  • moves valid files to archive
  • moves invalid files to failed

Source Identity Namespacing

Raw external IDs are not globally unique across basketball data sources. HoopScout v2 uses a namespaced identity for imported entities:

  • Competition: unique key is (source_name, source_uid)
  • Team: unique key is (source_name, source_uid)
  • Player: unique key is (source_name, source_uid)

source_uid values from different sources (for example lba and bcl) can safely overlap without overwriting each other.

Import history is visible in Django admin:

  • ImportRun
  • ImportFile

Extractor Framework (v2)

v2 keeps extraction and import as two separate steps:

  1. Extractors fetch public source content and emit normalized JSON snapshots.
  2. Importer (import_snapshots) validates and upserts those snapshots into PostgreSQL.

Extractor pipeline:

  • fetch (public endpoint/page requests with conservative HTTP behavior)
  • parse (source-specific structure)
  • normalize (map to HoopScout snapshot schema)
  • emit (write JSON file to incoming directory or custom path)

Built-in extractor in this phase:

  • public_json_snapshot (generic JSON feed extractor for MVP usage)
  • lba (Lega Basket Serie A MVP extractor)
  • bcl (Basketball Champions League MVP extractor)

Run extractor:

docker compose exec web python manage.py run_extractor public_json_snapshot

Run extractor with explicit output path (debugging):

docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming

Dry-run validation (no file write):

docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run

Run only the LBA extractor:

docker compose exec web python manage.py run_lba_extractor

Run only the BCL extractor:

docker compose exec web python manage.py run_bcl_extractor

Daily orchestration behavior

run_daily_orchestration performs:

  1. run configured extractors in order from DAILY_ORCHESTRATION_EXTRACTORS
  2. write snapshots to incoming dir
  3. run import_snapshots
  4. log extractor/import summary

Extractor environment variables:

  • EXTRACTOR_USER_AGENT
  • EXTRACTOR_HTTP_TIMEOUT_SECONDS
  • EXTRACTOR_HTTP_RETRIES
  • EXTRACTOR_RETRY_SLEEP_SECONDS
  • EXTRACTOR_REQUEST_DELAY_SECONDS
  • EXTRACTOR_PUBLIC_JSON_URL
  • EXTRACTOR_PUBLIC_SOURCE_NAME
  • EXTRACTOR_INCLUDE_RAW_PAYLOAD
  • EXTRACTOR_LBA_STATS_URL
  • EXTRACTOR_LBA_SEASON_LABEL
  • EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID
  • EXTRACTOR_LBA_COMPETITION_NAME
  • EXTRACTOR_BCL_STATS_URL
  • EXTRACTOR_BCL_SEASON_LABEL
  • EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID
  • EXTRACTOR_BCL_COMPETITION_NAME
  • DAILY_ORCHESTRATION_EXTRACTORS
  • DAILY_ORCHESTRATION_INTERVAL_SECONDS

Notes:

  • extraction is intentionally low-frequency and uses retries conservatively
  • only public pages/endpoints should be targeted
  • emitted snapshots must match the same schema consumed by import_snapshots
  • public_json_snapshot uses the same required-vs-optional field contract as SnapshotSchemaValidator (no stricter extractor-only required bio/physical fields)
  • optional scheduler container runs scripts/scheduler.sh loop using:
    • image: registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}
    • command: /app/scripts/scheduler.sh
    • interval: DAILY_ORCHESTRATION_INTERVAL_SECONDS
    • disabled idle interval: SCHEDULER_DISABLED_SLEEP_SECONDS

Scheduler entrypoint/runtime expectations

  • scheduler uses the same app image and base entrypoint.sh as web
  • scheduler requires database connectivity and snapshot volumes
  • scheduler is disabled unless:
    • compose scheduler profile is started
    • SCHEDULER_ENABLED=1
  • if scheduler service is started while disabled (SCHEDULER_ENABLED=0), it does not exit; it enters idle sleep mode to avoid restart loops with restart: unless-stopped
  • this keeps default runtime simple while supporting daily automation

LBA extractor assumptions and limitations (MVP)

  • source_name is fixed to lba
  • the extractor expects one stable public JSON payload that includes player/team/stat rows
  • competition is configured by environment and emitted as:
    • competition_external_id from EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID
    • competition_name from EXTRACTOR_LBA_COMPETITION_NAME
  • season is configured by EXTRACTOR_LBA_SEASON_LABEL
  • parser supports payload keys: records, data, players, items
  • normalization supports nested player and team objects with common stat aliases (gp/mpg/ppg/rpg/apg/spg/bpg/tov)
  • public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits null for optional fields
  • no live HTTP calls in tests; tests use fixtures/mocked responses only

BCL extractor assumptions and limitations (MVP)

  • source_name is fixed to bcl
  • the extractor expects one stable public JSON payload that includes player/team/stat rows
  • competition is configured by environment and emitted as:
    • competition_external_id from EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID
    • competition_name from EXTRACTOR_BCL_COMPETITION_NAME
  • season is configured by EXTRACTOR_BCL_SEASON_LABEL
  • parser supports payload keys: records, data, players, items
  • normalization supports nested player and team objects with common stat aliases (gp/mpg/ppg/rpg/apg/spg/bpg/tov)
  • public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits null for optional fields
  • no live HTTP calls in tests; tests use fixtures/mocked responses only

Testing

  • runtime web image stays lean and may not include pytest tooling
  • run tests with the development compose stack (or a dedicated test image/profile) where test dependencies are installed
  • local example:
docker compose -f docker-compose.yml -f docker-compose.dev.yml run --rm web pytest -q

Migration and Superuser Commands

docker compose exec web python manage.py migrate
docker compose exec web python manage.py createsuperuser

Health Endpoints

  • app health: /health/
  • nginx healthcheck proxies /health/ to web

Player Search (v2)

Public player search is server-rendered (Django templates) with HTMX partial updates.

Supported filters:

  • free text name search
  • nominal position, inferred role
  • competition, season, team
  • nationality
  • age, height, weight ranges
  • stats thresholds: games, MPG, PPG, RPG, APG, SPG, BPG, TOV, FG%, 3P%, FT%

Search correctness:

  • combined team/competition/season/stat filters are applied to the same PlayerSeason context (no cross-row false positives)
  • filtering happens at database level with Django ORM

Search metric semantics:

  • result columns are labeled as Best Eligible
  • each displayed metric is MAX over eligible player-season rows for that metric in the current filter context
  • different metric columns for one player may come from different eligible seasons
  • when no eligible value exists for a metric in the current context, the UI shows -

Pagination and sorting:

  • querystring is preserved
  • HTMX navigation keeps URL state in sync with current filters/page/sort

Saved Searches and Watchlist (v2)

Authenticated users can:

  • save current search filters from the player search page
  • re-run saved searches from scouting pages
  • rename/update/delete saved searches
  • update saved search filters via structured JSON in the edit screen
  • add/remove favorite players inline (HTMX-friendly) and browse watchlist

GitFlow

Required branch model:

  • main: production
  • develop: integration
  • feature/*, release/*, hotfix/*

This v2 work branch is:

  • feature/hoopscout-v2-static-architecture

Notes on Legacy Layers

Legacy provider/Celery ingestion layers are not the default runtime path for v2 foundation. They are intentionally isolated until replaced by v2 snapshot ingestion commands in later tasks.