hoopscout/README.md

# HoopScout v2 (Foundation Reset)

HoopScout v2 is a controlled greenfield rebuild inside the existing repository.

Current v2 foundation scope in this branch:
- Django + HTMX server-rendered app
- PostgreSQL as the only primary database
- nginx reverse proxy
- management-command-driven runtime operations
- static snapshot directories persisted via Docker named volumes
- strict JSON snapshot schema + import management command

Out of scope in this step:
- extractor implementation

## Runtime Architecture (v2)

Runtime services are intentionally small:
- `web` (Django/Gunicorn)
- `postgres` (primary DB)
- `nginx` (reverse proxy + static/media serving)
- optional `scheduler` profile service (runs daily extractor/import loop)

No Redis/Celery services are part of the v2 default runtime topology.
Legacy Celery/provider code is still in repository history/codebase but de-emphasized for v2.

## Image Strategy

Compose builds and tags images as:
- `registry.younerd.org/hoopscout/web:${APP_IMAGE_TAG:-latest}`
- `registry.younerd.org/hoopscout/nginx:${NGINX_IMAGE_TAG:-latest}`

Reserved for future optional scheduler use:
- `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}`

## Entrypoint Strategy

- `web`: `entrypoint.sh`
  - waits for PostgreSQL
  - optionally runs migrations/collectstatic
  - ensures snapshot directories exist
- `nginx`: `nginx/entrypoint.sh`
  - simple runtime entrypoint wrapper

## Compose Files

- `docker-compose.yml`: production-minded baseline runtime (immutable image filesystem)
- `docker-compose.dev.yml`: development override with source bind mount for `web`
- `docker-compose.release.yml`: production settings override (`DJANGO_SETTINGS_MODULE=config.settings.production`)

### Start development runtime

```bash
cp .env.example .env
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build
```

### Start release-style runtime

```bash
docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build
```

### Start scheduler profile (optional)

```bash
docker compose --profile scheduler up -d scheduler
```

For development override:

```bash
docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler
```

## Named Volumes

v2 runtime uses named volumes for persistence:
- `postgres_data`
- `static_data`
- `media_data`
- `snapshots_incoming`
- `snapshots_archive`
- `snapshots_failed`

Development override uses separate dev-prefixed volumes to avoid ownership collisions.

## Environment Variables

Use `.env.example` as the source of truth.

Core groups:
- Django runtime/security vars
- PostgreSQL connection vars
- image tag vars (`APP_IMAGE_TAG`, `NGINX_IMAGE_TAG`)
- snapshot directory vars (`STATIC_DATASET_*`)
- optional future scheduler vars (`SCHEDULER_*`)
- daily orchestration vars (`DAILY_ORCHESTRATION_*`)

## Snapshot Storage Convention

Snapshot files are expected under:
- incoming: `/app/snapshots/incoming`
- archive: `/app/snapshots/archive`
- failed: `/app/snapshots/failed`

Configured via environment:
- `STATIC_DATASET_INCOMING_DIR`
- `STATIC_DATASET_ARCHIVE_DIR`
- `STATIC_DATASET_FAILED_DIR`

## Snapshot JSON Schema (MVP)

Each file must be a JSON object:

```json
{
  "source_name": "official_site_feed",
  "snapshot_date": "2026-03-13",
  "records": [
    {
      "competition_external_id": "comp-nba",
      "competition_name": "NBA",
      "season": "2025-2026",
      "team_external_id": "team-lal",
      "team_name": "Los Angeles Lakers",
      "player_external_id": "player-23",
      "full_name": "LeBron James",
      "first_name": "LeBron",
      "last_name": "James",
      "birth_date": "1984-12-30",
      "nationality": "US",
      "height_cm": 206,
      "weight_kg": 113,
      "position": "SF",
      "role": "Primary Creator",
      "games_played": 60,
      "minutes_per_game": 34.5,
      "points_per_game": 25.4,
      "rebounds_per_game": 7.2,
      "assists_per_game": 8.1,
      "steals_per_game": 1.3,
      "blocks_per_game": 0.7,
      "turnovers_per_game": 3.2,
      "fg_pct": 51.1,
      "three_pt_pct": 38.4,
      "ft_pct": 79.8,
      "source_metadata": {},
      "raw_payload": {}
    }
  ],
  "source_metadata": {},
  "raw_payload": {}
}
```

Validation is strict:
- unknown fields are rejected
- required fields must exist
- `snapshot_date` and `birth_date` must be `YYYY-MM-DD`
- numeric fields must be numeric
- invalid files are moved to failed directory

## Import Command

Run import:

```bash
docker compose exec web python manage.py import_snapshots
```

Run end-to-end daily orchestration manually (extractors -> import):

```bash
docker compose exec web python manage.py run_daily_orchestration
```

Command behavior:
- scans `STATIC_DATASET_INCOMING_DIR` for `.json` files
- validates strict schema
- computes SHA-256 checksum
- creates `ImportRun` + `ImportFile` records
- upserts relational entities (`Competition`, `Season`, `Team`, `Player`, `PlayerSeason`, `PlayerSeasonStats`)
- skips duplicate content using checksum
- moves valid files to archive
- moves invalid files to failed

### Source Identity Namespacing

Raw external IDs are **not globally unique** across basketball data sources. HoopScout v2 uses a namespaced identity for imported entities:
- `Competition`: unique key is `(source_name, source_uid)`
- `Team`: unique key is `(source_name, source_uid)`
- `Player`: unique key is `(source_name, source_uid)`

`source_uid` values from different sources (for example `lba` and `bcl`) can safely overlap without overwriting each other.

Import history is visible in Django admin:
- `ImportRun`
- `ImportFile`

## Extractor Framework (v2)

v2 keeps extraction and import as two separate steps:

1. **Extractors** fetch public source content and emit normalized JSON snapshots.
2. **Importer** (`import_snapshots`) validates and upserts those snapshots into PostgreSQL.

Extractor pipeline:
- `fetch` (public endpoint/page requests with conservative HTTP behavior)
- `parse` (source-specific structure)
- `normalize` (map to HoopScout snapshot schema)
- `emit` (write JSON file to incoming directory or custom path)

Built-in extractor in this phase:
- `public_json_snapshot` (generic JSON feed extractor for MVP usage)
- `lba` (Lega Basket Serie A MVP extractor)
- `bcl` (Basketball Champions League MVP extractor)

Run extractor:

```bash
docker compose exec web python manage.py run_extractor public_json_snapshot
```

Run extractor with explicit output path (debugging):

```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming
```

Dry-run validation (no file write):

```bash
docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run
```

Run only the LBA extractor:

```bash
docker compose exec web python manage.py run_lba_extractor
```

Run only the BCL extractor:

```bash
docker compose exec web python manage.py run_bcl_extractor
```

### Daily orchestration behavior

`run_daily_orchestration` performs:
1. run configured extractors in order from `DAILY_ORCHESTRATION_EXTRACTORS`
2. write snapshots to incoming dir
3. run `import_snapshots`
4. log extractor/import summary

Extractor environment variables:
- `EXTRACTOR_USER_AGENT`
- `EXTRACTOR_HTTP_TIMEOUT_SECONDS`
- `EXTRACTOR_HTTP_RETRIES`
- `EXTRACTOR_RETRY_SLEEP_SECONDS`
- `EXTRACTOR_REQUEST_DELAY_SECONDS`
- `EXTRACTOR_PUBLIC_JSON_URL`
- `EXTRACTOR_PUBLIC_SOURCE_NAME`
- `EXTRACTOR_INCLUDE_RAW_PAYLOAD`
- `EXTRACTOR_LBA_STATS_URL`
- `EXTRACTOR_LBA_SEASON_LABEL`
- `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID`
- `EXTRACTOR_LBA_COMPETITION_NAME`
- `EXTRACTOR_BCL_STATS_URL`
- `EXTRACTOR_BCL_SEASON_LABEL`
- `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID`
- `EXTRACTOR_BCL_COMPETITION_NAME`
- `DAILY_ORCHESTRATION_EXTRACTORS`
- `DAILY_ORCHESTRATION_INTERVAL_SECONDS`

Notes:
- extraction is intentionally low-frequency and uses retries conservatively
- only public pages/endpoints should be targeted
- emitted snapshots must match the same schema consumed by `import_snapshots`
- optional scheduler container runs `scripts/scheduler.sh` loop using:
  - image: `registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}`
  - command: `/app/scripts/scheduler.sh`
  - interval: `DAILY_ORCHESTRATION_INTERVAL_SECONDS`

### Scheduler entrypoint/runtime expectations

- scheduler uses the same app image and base `entrypoint.sh` as web
- scheduler requires database connectivity and snapshot volumes
- scheduler is disabled unless:
  - compose `scheduler` profile is started
  - `SCHEDULER_ENABLED=1`
- this keeps default runtime simple while supporting daily automation

### LBA extractor assumptions and limitations (MVP)

- `source_name` is fixed to `lba`
- the extractor expects one stable public JSON payload that includes player/team/stat rows
- competition is configured by environment and emitted as:
  - `competition_external_id` from `EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID`
  - `competition_name` from `EXTRACTOR_LBA_COMPETITION_NAME`
- season is configured by `EXTRACTOR_LBA_SEASON_LABEL`
- parser supports payload keys: `records`, `data`, `players`, `items`
- normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`)
- no live HTTP calls in tests; tests use fixtures/mocked responses only

### BCL extractor assumptions and limitations (MVP)

- `source_name` is fixed to `bcl`
- the extractor expects one stable public JSON payload that includes player/team/stat rows
- competition is configured by environment and emitted as:
  - `competition_external_id` from `EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID`
  - `competition_name` from `EXTRACTOR_BCL_COMPETITION_NAME`
- season is configured by `EXTRACTOR_BCL_SEASON_LABEL`
- parser supports payload keys: `records`, `data`, `players`, `items`
- normalization supports nested `player` and `team` objects with common stat aliases (`gp/mpg/ppg/rpg/apg/spg/bpg/tov`)
- no live HTTP calls in tests; tests use fixtures/mocked responses only

## Migration and Superuser Commands

```bash
docker compose exec web python manage.py migrate
docker compose exec web python manage.py createsuperuser
```

## Health Endpoints

- app health: `/health/`
- nginx healthcheck proxies `/health/` to `web`

## Player Search (v2)

Public player search is server-rendered (Django templates) with HTMX partial updates.

Supported filters:
- free text name search
- nominal position, inferred role
- competition, season, team
- nationality
- age, height, weight ranges
- stats thresholds: games, MPG, PPG, RPG, APG, SPG, BPG, TOV, FG%, 3P%, FT%

Search correctness:
- combined team/competition/season/stat filters are applied to the same `PlayerSeason` context (no cross-row false positives)
- filtering happens at database level with Django ORM

Search metric semantics:
- result columns are labeled as **Best Eligible**
- each displayed metric is `MAX` over eligible player-season rows for that metric in the current filter context
- different metric columns for one player may come from different eligible seasons
- when no eligible value exists for a metric in the current context, the UI shows `-`

Pagination and sorting:
- querystring is preserved
- HTMX navigation keeps URL state in sync with current filters/page/sort

## Saved Searches and Watchlist (v2)

Authenticated users can:
- save current search filters from the player search page
- re-run saved searches from scouting pages
- rename/update/delete saved searches
- update saved search filters via structured JSON in the edit screen
- add/remove favorite players inline (HTMX-friendly) and browse watchlist

## GitFlow

Required branch model:
- `main`: production
- `develop`: integration
- `feature/*`, `release/*`, `hotfix/*`

This v2 work branch is:
- `feature/hoopscout-v2-static-architecture`

## Notes on Legacy Layers

Legacy provider/Celery ingestion layers are not the default runtime path for v2 foundation.
They are intentionally isolated until replaced by v2 snapshot ingestion commands in later tasks.