ConTact: Contrastive Tactile Alignment

ConTact: Contrastive Tactile Alignment for Sim-to-Real Robotic Manipulation

Yanlin Lai^1,2,*, Yinzhao Dong^1,*,†, Chun Yuan^2,† Cheng Zhou¹

¹Tencent Robotics X, ²Tsinghua University
^*Equal Contribution
^†Corresponding author

Abstract

Deep reinforcement learning (DRL) has achieved remarkable success in robot control. However, DRL with tactile feedback still faces challenges in contact-rich tasks involving visual occlusion or high-speed dynamics. The challenges stem from the complexity of real-world tactile sensors and the computational intensity of high-fidelity simulators.

To address this, we design a high-speed tactile simulation model, enabling efficient, large-scale DRL training on GPUs. We then propose the Contrastive Tactile (ConTact) framework, which leverages contrastive learning to align tactile features for sim-to-real transfer. ConTact employs a dedicated spatiotemporal encoder that explicitly models temporal changes to capture the dynamic features of contact events.

We validate it on two manipulation tasks, Single and Composite Object Tracking (SOT/COT), which rely solely on tactile information. Policies trained with ConTact from simulation are directly deployed in the real world without finetuning, achieving zero-shot transfer.

(a) Contrastive Learning Pre-training. (b) Spatio-temporal Encoder Architecture. (c) Downstream RL Task & Deployment.

The ConTact framework addresses the sim-to-real challenge by aligning tactile data features. We first design a computationally efficient ray-casting tactile simulation model in MuJoCo MJX. Based on this, we propose a pre-training Contrastive Tactile (ConTact) framework.

As depicted above, ConTact leverages contrastive learning (using a symmetric contrastive loss \(\mathcal{L}_{CTA}\)) and a spatio-temporal encoder to align tactile features across simulated and real domains. This allows us to extract unified representations for downstream DRL tasks, eliminating the need for real-world fine-tuning.

Tactile Array Simulation

Real-world vs Simulated Tactile Array

(a) Single Object Tracking (SOT). (b) Composite Object Tracking (COT).

We introduce two kinds of manipulation tasks to evaluate our framework:

Single Object Tracking (SOT): Requires the robot to keep a spherical object centered on the tactile tray. This serves as a baseline to evaluate the fundamental capability of inferring object position.
Composite Object Tracking (COT): A more advanced challenge manipulating a composite toy car (box body + two capsule wheels). The controller must detect asymmetric loading and pressure imbalances to prevent structural collapse (car-wheel separation).

For both tasks, the policy relies solely on tactile features and proprioception, without any visual input.

Visual results from zero-shot policy deployment on a Kinova Gen3 arm.

We evaluate the capability to stabilize objects during dynamic movements across four distinct reference trajectories: linear, circular, lemniscate, and ascending helical. The policy demonstrates robustness against dynamic challenges, including real-time object switching, varied initial positions, and external perturbations (e.g., being tapped by a hammer).

Our force-based tactile model significantly outperforms binary-signal baselines. As shown in our ablation studies, binary signals are ambiguous and unable to differentiate unstable states in the Composite Object Tracking task.

ConTact: Contrastive Tactile Alignment for Sim-to-Real Robotic Manipulation

Abstract

ConTact Framework

(a) Contrastive Learning Pre-training. (b) Spatio-temporal Encoder Architecture. (c) Downstream RL Task & Deployment.

Tactile Array Simulation

Manipulation Tasks

(a) Single Object Tracking (SOT). (b) Composite Object Tracking (COT).

Reconstruction & Feature Alignment

Zero-Shot Real-World Deployment

Visual results from zero-shot policy deployment on a Kinova Gen3 arm.

Performance Demos

Single Object Tracking (SOT)

Composite Object Tracking (COT)

BibTeX