Objective
WristP2 is a wrist-worn system with a wide-FOV fisheye camera that reconstructs 3D hand pose and per-vertex pressure in real time.
*Across previous works from 2012–2025, the average reported performance is roughly MPJPE ≈ 10.5 mm, MJAE ≈ 7.1°, and contact accuracy ≈ 92%.
2.88 mm
mean per-joint position error
3.15°
mean joint angle error
10.3 gram
mean pressure error
97 %
contact accuracy
3 h
battery life at full power
72%
Contact IoU
Application
Utilizing the detailed hand pose and pressure information generated by WristP2, we have effectively applied it across a wide spectrum of scenarios.
Mid-air gesture input in XR
WRISTP2 enables mid-air gestures for natural interaction.
The video demonstrates how mid-air hand gestures can be used to control video playback in XR environments.
Custom Action Control Media
WRISTP2 knows the meaning of all your actions.The video shows how WristP2 supports large-screen interaction in mobile contexts, such as controlling slide presentations during a talk.
Virtual touchpad input on a mobile device
WRISTP2 enables planar virtual touchpad input.The video illustrates how a user browses web pages on a virtual display by using planar input on a virtual touchpad.
Replacing traditional interaction tools
WRISTP2 can replace traditional interactive tools.The video demonstrates how it achieves left and right mouse clicks, which can theoretically be performed on any surface.
Dataset:
We built a synchronized multi-sensor system to create a large-scale dataset of 93,000 frames from 15 participants.
· Synced Sensors: We combined professional Motion Capture (for precise hand tracking) with a high-resolution Pressure Pad (to record touch force). This allowed us to capture both 3D hand poses and physical pressure simultaneously.
· Diverse Scenarios: The dataset covers 48 surface interactions (like clicking and dragging) and 28 mid-air gestures, recorded under various lighting conditions and backgrounds to ensure the system works reliably in the real world.
We developed an automated pipeline that aligns a 3D hand model with sensor data to generate high-fidelity meshes and per-point pressure labels without manual annotation.
Tech Stack:
Our AI uses a Vision Transformer combined with a VQ-VAE, which reconstructs hands by selecting from a learned "library" of realistic poses to ensure biological accuracy. It simultaneously predicts 3D shape, pressure, and camera position from a single image.
* Example of the wrist-mounted camera image with random replaced background
To ensure robustness, we pre-trained the model on general hand data and then fine-tuned it with randomized backgrounds, teaching it to ignore environmental distractions.