hoopscout-v2/docs/adr/0009-real-data-ingestion-baseline.md

# ADR-0009: Real-Data Ingestion Baseline

## Status
Accepted

## Context
The scouting MVP is working with seeded data, but there is no accepted baseline for how real external data should enter the system. Before building the first importer, we need one predictable ingestion shape, clear identity rules, and explicit ownership boundaries so repeated imports are safe and do not damage scouting or user-owned data.

## Decision

### 1. Ingestion strategy baseline
Use a command-oriented, source-specific baseline.

- Each source gets its own Django management command under `scouting`.
- The first importer is single-source and single-competition scoped.
- Command behavior for MVP:
  - read a structured input payload for one source (no scraping in this phase)
  - normalize to HoopScout field conventions
  - validate required identifiers and required MVP fields
  - upsert supported imported models in deterministic order inside a transaction
- Keep execution synchronous and local-command driven for now.

This is the smallest safe baseline: simple to run, testable, and repeatable without introducing an ingestion platform.

### 2. Source identity and external identifiers
Imports must use deterministic matching keys. Fuzzy matching is out of scope for MVP imports.

External identifier policy:
- `Player`: require source external player ID.
- `Competition`: require source external competition ID.
- `Team`: require source external team ID.
- `Season`: use internal canonical season identity (`name`, `start_year`, `end_year`) as the MVP match key; source season ID may be added later if needed.
- `PlayerSeason`: require a deterministic context identity built from source IDs for player + season + team + competition.
- `PlayerSeasonStats`: match strictly by the resolved `PlayerSeason` (one stats row per player-season context in current model).

Model-shape guidance for implementation:
- Add source-aware external identity mapping support before first importer write path.
- Preferred MVP shape: small generic mapping table keyed by (`source_name`, `entity_type`, `external_id`) -> internal object reference.
- Avoid source-specific columns spread across core domain tables in MVP.

Duplication/merge safety rules:
- Never merge two internal entities without deterministic ID evidence.
- If deterministic identity is missing, skip row and report validation error (do not guess).
- Re-running the same source payload must update existing mapped records, not create duplicates.

### 3. Data ownership boundary
Three ownership zones are explicit:

- Imported source data (updatable by importer):
  - objective player profile fields provided by source
  - competition/team/season context fields provided by source
  - player-season and stats payload fields provided by source
- Internal scouting enrichment (not overwritten by importer by default):
  - roles
  - specialties
  - future internal scouting metadata unless explicitly marked importer-owned in a later ADR
- User-scoped product data (never touched by importer):
  - favorites
  - notes
  - saved searches

Importer code must not overwrite user-owned or internal enrichment data unless a later decision explicitly allows it.

### 4. Update semantics baseline
Repeated imports are expected and must be idempotent.

- Upsert allowed:
  - `Player`, `Competition`, `Team`, `Season`, `PlayerSeason`, `PlayerSeasonStats` for importer-owned fields only.
- Immutable/append-only assumptions for MVP:
  - user-owned tables remain untouched
  - internal enrichment remains untouched
  - import-run audit rows (if added) are append-only
- Optional fields:
  - missing optional source fields must not block ingestion; store null/empty where allowed
- Conflict behavior:
  - deterministic key mismatch or missing required IDs is a row-level error, not a heuristic merge.

### 5. MVP ingestion scope
The first ingestion implementation must stay narrow:

- one source
- one competition
- one command-oriented flow
- minimal useful fields only (core player identity/profile + player-season context + stats already represented by current model)

No scraping framework, no async worker orchestration, and no multi-source reconciliation engine in this phase.

### 6. Model impact guidance
Before first importer implementation, introduce only the minimum schema support needed for deterministic identity and repeatability:

- required: source/external ID mapping support
- recommended: lightweight import-run tracking for observability and replay confidence
- not required now:
  - broad domain redesign
  - full provenance graph
  - multi-source conflict resolution model

### 7. Implementation guidance for the next prompt
The next ingestion implementation task should assume:

- first importer shape:
  - source-specific management command
  - structured input -> normalize -> validate -> transactional upsert
- identity matching:
  - external ID mapping required for player/team/competition/player-season contexts
  - season matched by canonical internal season identity in MVP
- repeatability/idempotency:
  - safe to run same payload multiple times with stable results
- data touch boundaries:
  - importer updates only importer-owned objective fields
  - importer does not modify roles/specialties/favorites/notes/saved searches

## Alternatives considered

### A. Direct ad-hoc scripts that write straight to tables
Rejected. Too hard to verify, repeat, and maintain safely across contributors.

### B. Full ingestion platform first (queues, orchestrator, multi-source reconciliation)
Rejected. Too much complexity for current phase and first real importer scope.

### C. Natural-key-only matching without external identifiers
Rejected. High duplication and ambiguous merge risk across repeated imports.

### D. Source-specific external ID columns on each core model
Not chosen for MVP baseline. It couples core schema to individual sources and scales poorly when adding sources.

## Trade-offs
- Pros:
  - deterministic identity and safer repeat imports
  - clear ownership boundaries that protect scouting/user data
  - minimal implementation surface for first real importer
- Cons:
  - requires adding mapping support before importer write path
  - strict ID requirements may reject incomplete rows instead of ingesting partial guesses

## Consequences
- Future importer implementation can proceed without re-deciding baseline ingestion shape.
- First importer work will prioritize deterministic mapping and idempotent upserts over source breadth.
- Search/product layers remain stable while objective source data becomes replaceable and repeatable.

## Follow-up decisions needed
1. Exact schema design for source/external ID mapping table and object reference strategy.
2. Whether import-run tracking is mandatory in MVP or phase-2.1.
3. Field-by-field importer ownership matrix (which columns are importer-owned vs internal-only).
4. Error-report output format for skipped/invalid rows.
5. When to add source season external IDs if natural season identity becomes insufficient.