Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

1School of Electrical Engineering, KAIST

2Division of Language & AI, HUFS

3Graduate School of Artificial Intelligence, UNIST

CVPR 2026

Abstract

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.


Overview

Tactile Localization

We introduce tactile localization, a task where a model is given a tactile input and must identify all regions in an image that share the same material properties. Unlike prior visuo-tactile methods that determine whether an image and a touch correspond to the same material, our model localizes where in a visual scene a given tactile sensation exists — producing dense saliency maps that highlight material regions matching the touch input. This cross-modal task requires fine-grained spatial reasoning beyond the coarse semantic alignment addressed by existing approaches.


Framework

STT Framework Pipeline

Seeing Through Touch (STT) encodes paired tactile and visual inputs into a shared feature space and learns fine-grained local cross-modal alignment via contrastive learning. Rather than comparing global pooled representations, STT computes dense similarity maps between aggregated tactile features and local visual features, enabling spatially precise material localization.


Web-Material Dataset

Web-Material dataset samples
32K images 18 categories Multi-material scenes Pixel-level annotations

Existing visuo-tactile datasets consist mostly of close-up, single-material images that lack scene-level diversity. We introduce Web-Material, a new dataset comprising web-crawled scene-level images and curated samples from prior material recognition datasets, together covering 32K images across 18 categories. Each image captures materials in diverse real-world contexts, and is paired with tactile signals from Touch-and-Go[1] via material diversity-based pairing. Web-Material serves as both a training source and an evaluation benchmark with pixel-level material segmentation annotations.


Results

Tactile Localization

Given a tactile input, the model identifies all image regions that share the same material properties.

Model TG-Test[1] Web-Material OpenSurfaces[2]
mAPmIoU mAPmIoU mAPmIoU
Visual Heatmap
DINOv3 Att. Map 83.7474.27 62.7347.12 18.9119.04
Global Alignment
TVL w/o Language[3] 70.6168.12 32.1632.16 17.9318.61
STT-CLS 73.6373.49 39.3534.74 17.9819.07
Local Alignment
STT-Local 85.1276.79 67.7252.34 37.2529.47
STT-Indomain 86.9577.58 71.3355.73 42.5434.10
SeeingThroughTouch (STT) 87.5676.82 77.4360.94 48.0636.73

TG-Test

TG-Test Qualitative Results

Web-Material

Web-Material Qualitative Results

OpenSurfaces

OpenSurfaces Qualitative Results

Figure 1. Qualitative Tactile Localization Results on TG-Test, Web-Material, and OpenSurfaces.

Interactive Localization

The model correctly shifts its predicted region when a different tactile signal is provided for the same scene.

Model IIoU
Baselines
TVL 1.0
Ours
STT-CLS 4.0
STT-Local 30.0
STT-Indomain 32.0
SeeingThroughTouch (STT) 37.0
Interactive Localization Qualitative Results

Figure 2. Qualitative Results on Interactive Localization.

Additional Scenarios

Some regions in the scene are replaced with different materials. The model correctly deactivates its prediction for replaced regions and reactivates them when the tactile signal is updated to match the new material.

Material Change Scenario 1
Material Change Scenario 2

Figure 3. The model interactively updates localization in response to changes in material and corresponding tactile inputs.


BibTeX

@inproceedings{kim2026seeingthroughtouch, author = {Seongyu Kim and Seungwoo Lee and Hyeonggon Ryu and Joon Son Chung and Arda Senocak}, title = {Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, }

References

  1. Fengyu Yang et al., "Touch and Go: Learning from Human-Collected Vision and Touch," NeurIPS Datasets and Benchmarks Track, 2022.
  2. Sean Bell et al., "OpenSurfaces: A Richly Annotated Catalog of Surface Appearance," SIGGRAPH, 2013.
  3. Letian Fu et al., "A Touch, Vision, and Language Dataset for Multimodal Alignment," ICML, 2024.