ConTact: Contrastive Tactile Alignment for Sim-to-Real Robotic Manipulation

Yanlin Lai1,2,*, Yinzhao Dong1,*,†, Chun Yuan2,† Cheng Zhou1
1Tencent Robotics X, 2Tsinghua University
*Equal Contribution

Corresponding author

Abstract

Deep reinforcement learning (DRL) has achieved remarkable success in robot control. However, DRL with tactile feedback still faces challenges in contact-rich tasks involving visual occlusion or high-speed dynamics. The challenges stem from the complexity of real-world tactile sensors and the computational intensity of high-fidelity simulators.

To address this, we design a high-speed tactile simulation model, enabling efficient, large-scale DRL training on GPUs. We then propose the Contrastive Tactile (ConTact) framework, which leverages contrastive learning to align tactile features for sim-to-real transfer. ConTact employs a dedicated spatiotemporal encoder that explicitly models temporal changes to capture the dynamic features of contact events.

We validate it on two manipulation tasks, Single and Composite Object Tracking (SOT/COT), which rely solely on tactile information. Policies trained with ConTact from simulation are directly deployed in the real world without finetuning, achieving zero-shot transfer.

ConTact Framework

ConTact Framework

(a) Contrastive Learning Pre-training. (b) Spatio-temporal Encoder Architecture. (c) Downstream RL Task & Deployment.

The ConTact framework addresses the sim-to-real challenge by aligning tactile data features. We first design a computationally efficient ray-casting tactile simulation model in MuJoCo MJX. Based on this, we propose a pre-training Contrastive Tactile (ConTact) framework.

As depicted above, ConTact leverages contrastive learning (using a symmetric contrastive loss \(\mathcal{L}_{CTA}\)) and a spatio-temporal encoder to align tactile features across simulated and real domains. This allows us to extract unified representations for downstream DRL tasks, eliminating the need for real-world fine-tuning.

Tactile Array Simulation

Tactile Sensor

Real-world vs Simulated Tactile Array

Manipulation Tasks

SOT and COT Tasks

(a) Single Object Tracking (SOT). (b) Composite Object Tracking (COT).

We introduce two kinds of manipulation tasks to evaluate our framework:

  • Single Object Tracking (SOT): Requires the robot to keep a spherical object centered on the tactile tray. This serves as a baseline to evaluate the fundamental capability of inferring object position.
  • Composite Object Tracking (COT): A more advanced challenge manipulating a composite toy car (box body + two capsule wheels). The controller must detect asymmetric loading and pressure imbalances to prevent structural collapse (car-wheel separation).

For both tasks, the policy relies solely on tactile features and proprioception, without any visual input.

Reconstruction & Feature Alignment

Reconstruction

Real-world Input vs Reconstructed Signal

Latent Space

t-SNE visualization of aligned latent features

Zero-Shot Real-World Deployment

Real World Results

Visual results from zero-shot policy deployment on a Kinova Gen3 arm.

We evaluate the capability to stabilize objects during dynamic movements across four distinct reference trajectories: linear, circular, lemniscate, and ascending helical. The policy demonstrates robustness against dynamic challenges, including real-time object switching, varied initial positions, and external perturbations (e.g., being tapped by a hammer).

Our force-based tactile model significantly outperforms binary-signal baselines. As shown in our ablation studies, binary signals are ambiguous and unable to differentiate unstable states in the Composite Object Tracking task.

Performance Demos

Single Object Tracking (SOT)

Composite Object Tracking (COT)

BibTeX

@article{contact2025,
  title={ConTact: Contrastive Tactile Alignment for Sim-to-Real Robotic Manipulation},
  author={Anonymous Authors},
  journal={Submission},
  year={2025}
}