Files
wifi-ruview/docs/adr/ADR-147-nvidia-cosmos-world-foundation-model-integration.md
rUv c7ddb2d7d1 feat(worldmodel): ADR-147 — OccWorld world model integration, wifi-densepose-worldmodel v0.3.0 (#856)
* feat(worldmodel): ADR-147 — OccWorld integration, wifi-densepose-worldmodel v0.3.0 (#854)

- New crate `wifi-densepose-worldmodel` v0.3.0: async Unix-socket bridge
  to OccWorld Python inference server; `OccWorldBridge`, `OccupancyGrid3D`,
  `TrajectoryPrior`, `worldgraph_to_occupancy` encoder (14/14 tests pass)
- `scripts/occworld_server.py`: long-lived Python inference server for
  OccWorld TransVQVAE (72.4M params); applies API-bug patches; dummy mode
  for CI testing; graceful SIGTERM shutdown
- `pose_tracker.rs`: `trajectory_prior` soft-blend injection (80/20
  Kalman/prior) on torso keypoint; `set_trajectory_prior()` public method
- CI: added `Run ADR-147 worldmodel tests` step
- ADR-147: accepted — OccWorld primary (209 ms, 3.37 GB VRAM, RTX 5080);
  Cosmos deferred to ADR-148 (32.54 GB VRAM exceeds hardware)
- Benchmark proof: 208.7 ms P50, 3.37 GB peak VRAM, 12.1 GB headroom

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: update ruvector.db state

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: ruvector.db sync

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(cli): add missing min_frames field to CalibrateArgs test helper

E0063 in calibrate.rs:448 — CalibrateArgs gained min_frames in ADR-135
but the default_args() test helper was not updated. min_frames=0 means
'use tier default', matching the existing runtime behaviour.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-29 16:53:51 -04:00

12 KiB
Raw Permalink Blame History

ADR-147: Occupancy World Model Integration (OccWorld / RoboOccWorld)

Field Value
Status Accepted
Date 2026-05-29
Deciders ruv
Relates to ADR-136, ADR-139, ADR-140, ADR-141, ADR-143, ADR-145, ADR-146

Previously titled "NVIDIA Cosmos WFM Integration". Decision revised after hardware analysis confirmed RTX 5080 (16 GB VRAM) cannot run Cosmos-Transfer2.5-2B (requires 32.54 GB). OccWorld runs in 1.65 GB VRAM at 375 ms/inference — validated locally.

1. Context

RuView's WorldGraph (ADR-139) produces a current-state environmental digital twin; the RF encoder (ADR-146) predicts present-frame pose/presence/count at ~20 Hz. There is no future-state prediction — no trajectory priors beyond the Kalman tracker's 510 frame horizon, and no physics-aware validation of SemanticState updates.

Two world-model families were evaluated:

1.1 NVIDIA Cosmos (deferred)

Cosmos-Transfer2.5-2B requires 32.54 GB VRAM. ruvultra has an RTX 5080 with 15.5 GB VRAM. Cannot run locally. Deferred to ADR-148 for when H100/A100 access is available or for offline training data generation only.

1.2 OccWorld / RoboOccWorld (this ADR)

Model Domain Input VRAM (inf) Status
OccWorld (wzzheng/OccWorld, ECCV 2024) Outdoor AV (nuScenes) 3D semantic voxel seq 1.65 GB validated Code available, Apache-2.0
RoboOccWorld (arXiv 2505.05512) Indoor robotics 3D voxel seq, camera poses ~24 GB estimated Code not yet released (~Q3 2025)

Both operate natively in 3D occupancy space — the same representation RuView produces from WiFi CSI. No video rendering intermediate is needed (unlike Cosmos).

OccWorld architecture: VQVAE tokenizer (72.4M params) encodes 3D semantic occupancy to discrete latent tokens → PlanUAutoRegTransformer predicts future tokens → VQVAE decoder reconstructs future 3D occupancy. Input: (B, F, H, W, D) voxel grid with integer class labels. Output: predicted occupancy for the next F1 timesteps.

RoboOccWorld (once released): identical paradigm but trained on indoor scenes (60×60×36 voxels at 0.08 m/voxel, 4.8×4.8×2.88 m space, 12 indoor semantic classes) — near-perfect match for RuView's room-scale CSI occupancy.

2. Decision

Phase A (now): Use OccWorld as the integration scaffold. Run inference from a Python subprocess. Adapt its dataset loader to accept RuView's custom occupancy format. Remap semantic classes from nuScenes outdoor (18 classes) to RuView indoor (wall, floor, person, furniture, free).

Phase B (Q3Q4 2025): Swap in RoboOccWorld when its code releases. The Rust OccupancyWorldModel interface (§3) is designed for clean backend swap.

Cosmos: Deferred. Revisit as an offline training data generator if H100 becomes available (ADR-148).

3. Validated Installation (ruvultra, 2026-05-29)

3.1 Environment

Component Version Notes
GPU RTX 5080, 15.5 GB VRAM sm_120 (Blackwell)
PyTorch 2.10.0+cu128 ml-env, Python 3.12
CUDA toolkit 12.8 /usr/local/cuda-12.8
mmcv 2.0.1 (Python-only, no CUDA ops) Built from source with pkg_resources patch
mmdet 3.0.0 pip install
mmdet3d 1.1.1 Built from source with --no-deps
mmengine 0.10.7 pip install via mmcv
OccWorld commit HEAD ~/projects/OccWorld

3.2 Build Notes

Issue 1 — sccache compiler wrapping: System CC=sccache clang, CXX=sccache clang++ breaks PyTorch CUDA extension builds (injects clang as a positional argument to the build command). Fix: unset CC CXX before all pip install.

Issue 2 — pkg_resources in mmcv setup.py: setuptools ≥72 removed the legacy pkg_resources top-level import. Fix: patch line 5 of setup.py to use importlib.metadata and packaging.version.

Issue 3 — CUDA version mismatch: host nvcc is CUDA 13.0; PyTorch was built with 12.8. Fix: CUDA_HOME=/usr/local/cuda-12.8 for all builds.

Issue 4 — mmcv 2.0.1 CUDA ops incompatible with PyTorch 2.10 ATen headers: c10::Type::TypePtr dereference operator changed. Fix: build MMCV_WITH_OPS=0 (Python-only build, mmcv-lite). OccWorld's inference path does not use mmcv CUDA ops.

Issue 5 — OccWorld API bug: TransVQVAE.forward_inference calls self.transformer(..., hidden=hidden) but PlanUAutoRegTransformer.forward(tokens, pose_tokens) has no hidden kwarg and returns a (queries, pose_queries) tuple. Fix: monkey-patch forward_inference to pass pose_tokens=zeros and unpack the tuple return. Applied in the Python subprocess at startup.

3.3 Validation Results

Input:  torch.Size([1, 16, 200, 200, 16])  — 16 frames (15 past + 1 offset)
Output: sem_pred   (1, 15, 200, 200, 16) int64  — predicted future occupancy
        logits     (1, 15, 200, 200, 16, 18) f32 — class logits
        iou_pred   (1, 15, 200, 200, 16) int64  — binary occupancy mask
Inference time: 375 ms
VRAM peak:      1.65 GB
Parameters:     72.4M

OccWorld produces 15 predicted future frames from 15 past frames of 3D semantic occupancy at 200×200×16 resolution with 18 classes — fully validated on RTX 5080.

4. Integration Architecture

4.1 Data Flow

ESP32-S3 CSI (20 Hz)
    │
    ▼
[ruvsense signal pipeline]  ── ADR-136 frame contracts
    │
    ▼
[RfEncoder / MultiTaskOutput]  ── ADR-146 pose + presence + count
    │  (sub-Hz WorldGraph update rate)
    ▼
[WorldGraph]  ── PersonTrack, ObjectAnchor, SemanticState  ── ADR-139/140
    │
    │  On semantic event (motion, activity change, fall-risk query)
    ▼
[BFLD Privacy Gate]  ── ADR-141: "occworld_inference" action
    │  PRIVATE/HOME → bridge NOT called
    │  MONITORING/AWAY → local inference permitted
    ▼
[wifi-densepose-worldmodel] ── Rust thin client (Unix socket)
    │
    ▼
[OccWorld Inference Server]  ── Python subprocess (~/projects/OccWorld)
    │  WorldGraph PersonTrack history → (B, F, H, W, D) occupancy tensor
    │  OccWorld forward_inference → sem_pred (15 future frames)
    │  Decode future voxels → TrajectoryPrior per PersonTrack
    │
    ▼
[Trajectory priors injected into ruvsense/pose_tracker.rs Kalman filter]
[WorldGraph::upsert_node(Event { predicted_movement, ... })]
    SemanticProvenance { model_version, calibration_id, privacy_decision }

4.2 Rust Interface (wifi-densepose-worldmodel crate — to be created)

Interface designed to be backend-agnostic (OccWorld today, RoboOccWorld when released):

pub struct OccupancyWorldModelRequest {
    pub past_frames: Vec<OccupancyGrid3D>,    // N frames of history
    pub voxel_resolution: f32,                // metres/voxel
    pub scene_bounds: AabbEnu,                // room extent in ENU
    pub prediction_steps: u32,                // how many future steps
}

pub struct OccupancyWorldModelResponse {
    pub future_frames: Vec<OccupancyGrid3D>,  // predicted future occupancy
    pub confidence: f32,
    pub model_id: String,                     // checkpoint hash for provenance
}

pub struct OccWorldBridge {
    socket_path: PathBuf,
    client: reqwest::Client,
}

impl OccWorldBridge {
    pub async fn predict(
        &self,
        request: OccupancyWorldModelRequest,
    ) -> Result<OccupancyWorldModelResponse, WorldModelError>;
}

4.3 RuView → OccWorld Adaptation (required before production use)

OccWorld was trained on nuScenes outdoor driving (200×200×16 at 0.4 m/voxel, 80×80×6.4 m, 18 outdoor classes). RuView uses indoor room-scale occupancy (~10×10×3 m at finer resolution). Required adaptations:

  1. New dataset loader: replace nuScenesSceneDatasetLidarTraverse with a RuViewOccDataset that reads WorldGraph history snapshots and returns the (B, F, H, W, D) tensor in OccWorld's expected format.
  2. Class remapping: 18 nuScenes outdoor classes → 6 RuView indoor classes (floor, wall, ceiling, person, furniture, free). Remap during tensor construction.
  3. Ego-pose zeroing: OccWorld uses rel_poses for ego-motion (AV driving); fixed indoor sensor has no ego-motion. Pass zero poses in forward_inference_with_plan.
  4. VQVAE retraining (optional but recommended): the discrete codebook was learned on outdoor scenes. Re-train VQVAE stage on RuView synthetic occupancy data before fine-tuning the transformer.
  5. Resolution rescaling: if indoor occupancy uses finer voxels (e.g. 0.08 m/voxel as in RoboOccWorld), bilinear-upsample to 200×200 for OccWorld, or retrain at native resolution.

4.4 Privacy Compliance (ADR-141)

The OccWorld bridge is a new occworld_inference action in the BFLD privacy control plane:

Action PRIVATE HOME MONITORING AWAY
occworld_inference (local)

All SemanticState nodes derived from predictions carry SemanticProvenance:

privacy_decision: PrivacyDecisionRef { mode, action: "occworld_inference", timestamp }
model_version: <OccWorld checkpoint hash>
calibration_id: <active baseline from ADR-135>

5. Consequences

5.1 Positive

  • Validated locally: 375 ms inference, 1.65 GB VRAM — fits comfortably on RTX 5080
  • 15-frame prediction horizon (~7.5 s at 2 Hz, or up to ~30 s at custom frame rate)
  • Native occupancy format: no video rendering intermediate unlike Cosmos
  • Clean swap boundary: OccWorldBridge trait swaps to RoboOccWorld without changing the Rust interface
  • 72.4M params: small enough to fine-tune on a single RTX 5080
  • No Python in Rust workspace: subprocess isolation preserves Rust-only mandate

5.2 Negative

  • Domain gap: nuScenes outdoor training vs indoor WiFi sensing — VQVAE codebook and transformer weights encode outdoor semantics; retraining required for quality results
  • No ego-pose equivalent in fixed indoor sensors — rel_poses must be zeroed
  • Pre-trained weights predict outdoor scene evolution; uncalibrated predictions for indoor scenes are semantically meaningless without retraining
  • RoboOccWorld (indoor-native, 0.08 m/voxel) not yet available; current OccWorld is a placeholder until it releases

5.3 Risks

Risk Likelihood Mitigation
RoboOccWorld delayed past Q4 2025 Medium OccWorld retrained on synthetic RuView data as fallback
VQVAE codebook quality low on indoor after retraining Low RoboOccWorld swap; OccWorld still useful for coarse occupancy
OccWorld API drift (unmaintained repo) Low Local fork at ~/projects/OccWorld; patches documented above
WorldGraph update rate too low for meaningful sequences Medium Log WorldGraph snapshots at configurable rate for inference

6. Implementation Phases

Phase Scope Status
1 Install OccWorld; validate forward pass with synthetic data Done (2026-05-29)
2 wifi-densepose-worldmodel Rust thin client crate (Unix socket bridge) Next
3 RuViewOccDataset loader + class remapping + ego-pose zeroing Pending
4 Trajectory prior injection into pose_tracker.rs Kalman filter Pending
5 VQVAE + transformer retraining on RuView synthetic occupancy Pending
6 Swap to RoboOccWorld backend when code releases Q3Q4 2025

7. Cosmos Path (Deferred — ADR-148)

NVIDIA Cosmos-Transfer2.5-2B and Cosmos-Reason2-8B remain the preferred world models for semantic plausibility evaluation and video-based simulation. They are deferred to ADR-148, which will cover:

  • H100/A100 access (cloud or co-lo) for Cosmos inference
  • Offline synthetic training data generation for ADR-146 RF encoder heads
  • Cosmos-Reason2-8B as a physics plausibility gate for SemanticState commits

8. References