ReVo - A Cross-Layer Reliable Volumetric Videoconferencing System

Abstract

The adoption of volumetric video for immersive applications is growing, driven by its support for six-degree-of-freedom (6-DoF) interaction. A major obstacle to its widespread use—particularly for real-time telepresence—is the challenge of streaming large data payloads over lossy networks without degrading the Quality of Experience (QoE).

To address this, we propose VoluStream, a novel framework for loss-resilient volumetric video streaming. Our system integrates a transport-layer strategy with an application-layer recovery engine. We first segregate frames into critical and non-critical sets; critical frames are sent via a reliable protocol, while non-critical ones are sent unreliably to minimize latency. Any resulting packet loss in the non-critical stream is handled by our Vision Transformer (ViT)-based recovery module, trained to restore both RGB and depth information.

The Challenge

Volumetric content (Point clouds, RGB-D, NeRFs) demands high storage and bandwidth. Existing works typically focus on:

Optimization: Improving encoder-decoder frameworks, which often neglects storage costs.
Compression: Designing neural compression frameworks that assume perfect network conditions.

However, networks are inherently lossy. Standard recovery methods like Retransmission (too slow for real-time) or Forward Error Correction (bandwidth heavy) are ill-suited for volumetric data. Furthermore, existing neural recovery methods (e.g., GRACE, REPARO) focus exclusively on 2D RGB video, failing to address the interdependency of Depth frames required for 3D reconstruction.

The VoluStream Approach

We propose a unified, hybrid framework that leverages the strengths of both transport and application layers.

Hybrid Transport: We characterize frames as “critical” or “non-critical.” Critical frames are sent over a reliable transport stream (TCP), while non-critical frames utilize an unreliable stream (QUIC over UDP) to minimize latency.
Neural Recovery: A novel ViT-based loss recovery module operates at the client side. It is specifically trained to reconstruct lost non-critical RGB and Depth frames, ensuring high fidelity even under packet loss.

Comparison of (a) RGB Reconstruction and (b) Depth Reconstruction under lossy network conditions.

System Architecture

Our prototype is built on a WebRTC system that achieves real-time performance (over 30 FPS) for videoconferencing applications. The architecture ensures that critical dependencies for 3D reconstruction are preserved while non-essential data is recovered neurally.

Key Contributions

First End-to-End System: To the best of our knowledge, VoluStream is the first system designed specifically for reliable volumetric video streaming.
Codec-Agnostic Recovery: A novel client-side neural framework that recovers corrupted RGB and Depth frames, compatible with existing compression codecs.
Real-Time Performance: A prototype WebRTC implementation demonstrating >30 FPS performance suitable for interactive videoconferencing.