bisco/hoopscout

Fork 0

Files

Alfredo Di Stasio 48a82e812a fix(v2-ingestion): align public schema realism follow-ups

2026-03-20 15:23:43 +01:00

14 KiB

Raw Blame History

HoopScout v2 (Foundation Reset)

HoopScout v2 is a controlled greenfield rebuild inside the existing repository.

Current v2 foundation scope in this branch:

Django + HTMX server-rendered app
PostgreSQL as the only primary database
nginx reverse proxy
management-command-driven runtime operations
static snapshot directories persisted via Docker named volumes
strict JSON snapshot schema + import management command

Out of scope in this step:

extractor implementation

Runtime Architecture (v2)

Runtime services are intentionally small:

web (Django/Gunicorn)
postgres (primary DB)
nginx (reverse proxy + static/media serving)
optional scheduler profile service (runs daily extractor/import loop)

No Redis/Celery services are part of the v2 default runtime topology. Legacy Celery/provider code is still in repository history/codebase but de-emphasized for v2.

Image Strategy

Compose builds and tags images as:

registry.younerd.org/hoopscout/web:${APP_IMAGE_TAG:-latest}
registry.younerd.org/hoopscout/nginx:${NGINX_IMAGE_TAG:-latest}

Reserved for future optional scheduler use:

registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}

Entrypoint Strategy

web: entrypoint.sh
- waits for PostgreSQL
- optionally runs migrations/collectstatic
- ensures snapshot directories exist
nginx: nginx/entrypoint.sh
- simple runtime entrypoint wrapper

Compose Files

docker-compose.yml: production-minded baseline runtime (immutable image filesystem)
docker-compose.dev.yml: development override with source bind mount for web
docker-compose.release.yml: production settings override (DJANGO_SETTINGS_MODULE=config.settings.production)

Start development runtime

cp .env.example .env
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build

Start release-style runtime

docker compose -f docker-compose.yml -f docker-compose.release.yml up -d --build

Start scheduler profile (optional)

docker compose --profile scheduler up -d scheduler

For development override:

docker compose -f docker-compose.yml -f docker-compose.dev.yml --profile scheduler up -d scheduler

Named Volumes

v2 runtime uses named volumes for persistence:

postgres_data
static_data
media_data
snapshots_incoming
snapshots_archive
snapshots_failed

Development override uses separate dev-prefixed volumes to avoid ownership collisions.

Environment Variables

Use .env.example as the source of truth.

Core groups:

Django runtime/security vars
PostgreSQL connection vars
image tag vars (APP_IMAGE_TAG, NGINX_IMAGE_TAG)
snapshot directory vars (STATIC_DATASET_*)
optional future scheduler vars (SCHEDULER_*)
daily orchestration vars (DAILY_ORCHESTRATION_*)

Snapshot Storage Convention

Snapshot files are expected under:

incoming: /app/snapshots/incoming
archive: /app/snapshots/archive
failed: /app/snapshots/failed

Configured via environment:

STATIC_DATASET_INCOMING_DIR
STATIC_DATASET_ARCHIVE_DIR
STATIC_DATASET_FAILED_DIR

Snapshot JSON Schema (MVP)

Each file must be a JSON object:

{
  "source_name": "official_site_feed",
  "snapshot_date": "2026-03-13",
  "records": [
    {
      "competition_external_id": "comp-nba",
      "competition_name": "NBA",
      "season": "2025-2026",
      "team_external_id": "team-lal",
      "team_name": "Los Angeles Lakers",
      "player_external_id": "player-23",
      "full_name": "LeBron James",
      "first_name": "LeBron",
      "last_name": "James",
      "birth_date": "1984-12-30",
      "nationality": "US",
      "height_cm": 206,
      "weight_kg": 113,
      "position": "SF",
      "role": "Primary Creator",
      "games_played": 60,
      "minutes_per_game": 34.5,
      "points_per_game": 25.4,
      "rebounds_per_game": 7.2,
      "assists_per_game": 8.1,
      "steals_per_game": 1.3,
      "blocks_per_game": 0.7,
      "turnovers_per_game": 3.2,
      "fg_pct": 51.1,
      "three_pt_pct": 38.4,
      "ft_pct": 79.8,
      "source_metadata": {},
      "raw_payload": {}
    }
  ],
  "source_metadata": {},
  "raw_payload": {}
}

Validation is strict:

unknown fields are rejected
required fields must exist:
- competition_external_id, competition_name, season
- team_external_id, team_name
- player_external_id, full_name
- core stats (games_played, minutes_per_game, points_per_game, rebounds_per_game, assists_per_game, steals_per_game, blocks_per_game, turnovers_per_game, fg_pct, three_pt_pct, ft_pct)
optional player bio/physical fields:
- first_name, last_name, birth_date, nationality, height_cm, weight_kg, position, role
when birth_date is provided it must be YYYY-MM-DD
numeric fields must be numeric
invalid files are moved to failed directory

Importer enrichment note:

full_name is source truth for identity display
first_name / last_name are optional and may be absent in public snapshots
when both are missing, importer may derive them from full_name as a best-effort enrichment step
this enrichment is convenience-only and does not override source truth semantics

Import Command

Run import:

docker compose exec web python manage.py import_snapshots

Run end-to-end daily orchestration manually (extractors -> import):

docker compose exec web python manage.py run_daily_orchestration

Command behavior:

scans STATIC_DATASET_INCOMING_DIR for .json files
validates strict schema
computes SHA-256 checksum
creates ImportRun + ImportFile records
upserts relational entities (Competition, Season, Team, Player, PlayerSeason, PlayerSeasonStats)
skips duplicate content using checksum
moves valid files to archive
moves invalid files to failed

Source Identity Namespacing

Raw external IDs are not globally unique across basketball data sources. HoopScout v2 uses a namespaced identity for imported entities:

Competition: unique key is (source_name, source_uid)
Team: unique key is (source_name, source_uid)
Player: unique key is (source_name, source_uid)

source_uid values from different sources (for example lba and bcl) can safely overlap without overwriting each other.

Import history is visible in Django admin:

ImportRun
ImportFile

Extractor Framework (v2)

v2 keeps extraction and import as two separate steps:

Extractors fetch public source content and emit normalized JSON snapshots.
Importer (import_snapshots) validates and upserts those snapshots into PostgreSQL.

Extractor pipeline:

fetch (public endpoint/page requests with conservative HTTP behavior)
parse (source-specific structure)
normalize (map to HoopScout snapshot schema)
emit (write JSON file to incoming directory or custom path)

Built-in extractor in this phase:

public_json_snapshot (generic JSON feed extractor for MVP usage)
lba (Lega Basket Serie A MVP extractor)
bcl (Basketball Champions League MVP extractor)

Run extractor:

docker compose exec web python manage.py run_extractor public_json_snapshot

Run extractor with explicit output path (debugging):

docker compose exec web python manage.py run_extractor public_json_snapshot --output-path /app/snapshots/incoming

Dry-run validation (no file write):

docker compose exec web python manage.py run_extractor public_json_snapshot --dry-run

Run only the LBA extractor:

docker compose exec web python manage.py run_lba_extractor

Run only the BCL extractor:

docker compose exec web python manage.py run_bcl_extractor

Daily orchestration behavior

run_daily_orchestration performs:

run configured extractors in order from DAILY_ORCHESTRATION_EXTRACTORS
write snapshots to incoming dir
run import_snapshots
log extractor/import summary

Extractor environment variables:

EXTRACTOR_USER_AGENT
EXTRACTOR_HTTP_TIMEOUT_SECONDS
EXTRACTOR_HTTP_RETRIES
EXTRACTOR_RETRY_SLEEP_SECONDS
EXTRACTOR_REQUEST_DELAY_SECONDS
EXTRACTOR_PUBLIC_JSON_URL
EXTRACTOR_PUBLIC_SOURCE_NAME
EXTRACTOR_INCLUDE_RAW_PAYLOAD
EXTRACTOR_LBA_STATS_URL
EXTRACTOR_LBA_SEASON_LABEL
EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID
EXTRACTOR_LBA_COMPETITION_NAME
EXTRACTOR_BCL_STATS_URL
EXTRACTOR_BCL_SEASON_LABEL
EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID
EXTRACTOR_BCL_COMPETITION_NAME
DAILY_ORCHESTRATION_EXTRACTORS
DAILY_ORCHESTRATION_INTERVAL_SECONDS

Notes:

extraction is intentionally low-frequency and uses retries conservatively
only public pages/endpoints should be targeted
emitted snapshots must match the same schema consumed by import_snapshots
public_json_snapshot uses the same required-vs-optional field contract as SnapshotSchemaValidator (no stricter extractor-only required bio/physical fields)
optional scheduler container runs scripts/scheduler.sh loop using:
- image: registry.younerd.org/hoopscout/scheduler:${APP_IMAGE_TAG:-latest}
- command: /app/scripts/scheduler.sh
- interval: DAILY_ORCHESTRATION_INTERVAL_SECONDS
- disabled idle interval: SCHEDULER_DISABLED_SLEEP_SECONDS

Scheduler entrypoint/runtime expectations

scheduler uses the same app image and base entrypoint.sh as web
scheduler requires database connectivity and snapshot volumes
scheduler is disabled unless:
- compose scheduler profile is started
- SCHEDULER_ENABLED=1
if scheduler service is started while disabled (SCHEDULER_ENABLED=0), it does not exit; it enters idle sleep mode to avoid restart loops with restart: unless-stopped
this keeps default runtime simple while supporting daily automation

LBA extractor assumptions and limitations (MVP)

source_name is fixed to lba
the extractor expects one stable public JSON payload that includes player/team/stat rows
competition is configured by environment and emitted as:
- competition_external_id from EXTRACTOR_LBA_COMPETITION_EXTERNAL_ID
- competition_name from EXTRACTOR_LBA_COMPETITION_NAME
season is configured by EXTRACTOR_LBA_SEASON_LABEL
parser supports payload keys: records, data, players, items
normalization supports nested player and team objects with common stat aliases (gp/mpg/ppg/rpg/apg/spg/bpg/tov)
public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits null for optional fields
no live HTTP calls in tests; tests use fixtures/mocked responses only

BCL extractor assumptions and limitations (MVP)

source_name is fixed to bcl
the extractor expects one stable public JSON payload that includes player/team/stat rows
competition is configured by environment and emitted as:
- competition_external_id from EXTRACTOR_BCL_COMPETITION_EXTERNAL_ID
- competition_name from EXTRACTOR_BCL_COMPETITION_NAME
season is configured by EXTRACTOR_BCL_SEASON_LABEL
parser supports payload keys: records, data, players, items
normalization supports nested player and team objects with common stat aliases (gp/mpg/ppg/rpg/apg/spg/bpg/tov)
public-source player bio/physical fields are often incomplete; extractor allows them to be missing and emits null for optional fields
no live HTTP calls in tests; tests use fixtures/mocked responses only

Testing

runtime web image stays lean and may not include pytest tooling
run tests with the development compose stack (or a dedicated test image/profile) where test dependencies are installed
local example:

docker compose -f docker-compose.yml -f docker-compose.dev.yml run --rm web pytest -q

Migration and Superuser Commands

docker compose exec web python manage.py migrate
docker compose exec web python manage.py createsuperuser

Health Endpoints

app health: /health/
nginx healthcheck proxies /health/ to web

Player Search (v2)

Public player search is server-rendered (Django templates) with HTMX partial updates.

Supported filters:

free text name search
nominal position, inferred role
competition, season, team
nationality
age, height, weight ranges
stats thresholds: games, MPG, PPG, RPG, APG, SPG, BPG, TOV, FG%, 3P%, FT%

Search correctness:

combined team/competition/season/stat filters are applied to the same PlayerSeason context (no cross-row false positives)
filtering happens at database level with Django ORM

Search metric semantics:

result columns are labeled as Best Eligible
each displayed metric is MAX over eligible player-season rows for that metric in the current filter context
different metric columns for one player may come from different eligible seasons
when no eligible value exists for a metric in the current context, the UI shows -

Pagination and sorting:

querystring is preserved
HTMX navigation keeps URL state in sync with current filters/page/sort

Saved Searches and Watchlist (v2)

Authenticated users can:

save current search filters from the player search page
re-run saved searches from scouting pages
rename/update/delete saved searches
update saved search filters via structured JSON in the edit screen
add/remove favorite players inline (HTMX-friendly) and browse watchlist

GitFlow

Required branch model:

main: production
develop: integration
feature/*, release/*, hotfix/*

This v2 work branch is:

feature/hoopscout-v2-static-architecture

Notes on Legacy Layers

Legacy provider/Celery ingestion layers are not the default runtime path for v2 foundation. They are intentionally isolated until replaced by v2 snapshot ingestion commands in later tasks.

14 KiB Raw Blame History