1School of Electrical Engineering, KAIST
2Division of Language & AI, HUFS
3Graduate School of Artificial Intelligence, UNIST
We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.
We introduce tactile localization, a task where a model is given a tactile input and must identify all regions in an image that share the same material properties. Unlike prior visuo-tactile methods that determine whether an image and a touch correspond to the same material, our model localizes where in a visual scene a given tactile sensation exists — producing dense saliency maps that highlight material regions matching the touch input. This cross-modal task requires fine-grained spatial reasoning beyond the coarse semantic alignment addressed by existing approaches.
Seeing Through Touch (STT) encodes paired tactile and visual inputs into a shared feature space and learns fine-grained local cross-modal alignment via contrastive learning. Rather than comparing global pooled representations, STT computes dense similarity maps between aggregated tactile features and local visual features, enabling spatially precise material localization.
Existing visuo-tactile datasets consist mostly of close-up, single-material images that lack scene-level diversity. We introduce Web-Material, a new dataset comprising web-crawled scene-level images and curated samples from prior material recognition datasets, together covering 32K images across 18 categories. Each image captures materials in diverse real-world contexts, and is paired with tactile signals from Touch-and-Go[1] via material diversity-based pairing. Web-Material serves as both a training source and an evaluation benchmark with pixel-level material segmentation annotations.
Given a tactile input, the model identifies all image regions that share the same material properties.
| Model | TG-Test[1] | Web-Material | OpenSurfaces[2] | |||
|---|---|---|---|---|---|---|
| mAP | mIoU | mAP | mIoU | mAP | mIoU | |
| Visual Heatmap | ||||||
| DINOv3 Att. Map | 83.74 | 74.27 | 62.73 | 47.12 | 18.91 | 19.04 |
| Global Alignment | ||||||
| TVL w/o Language[3] | 70.61 | 68.12 | 32.16 | 32.16 | 17.93 | 18.61 |
| STT-CLS | 73.63 | 73.49 | 39.35 | 34.74 | 17.98 | 19.07 |
| Local Alignment | ||||||
| STT-Local | 85.12 | 76.79 | 67.72 | 52.34 | 37.25 | 29.47 |
| STT-Indomain | 86.95 | 77.58 | 71.33 | 55.73 | 42.54 | 34.10 |
| SeeingThroughTouch (STT) | 87.56 | 76.82 | 77.43 | 60.94 | 48.06 | 36.73 |
Figure 1. Qualitative Tactile Localization Results on TG-Test, Web-Material, and OpenSurfaces.
The model correctly shifts its predicted region when a different tactile signal is provided for the same scene.
| Model | IIoU |
|---|---|
| Baselines | |
| TVL | 1.0 |
| Ours | |
| STT-CLS | 4.0 |
| STT-Local | 30.0 |
| STT-Indomain | 32.0 |
| SeeingThroughTouch (STT) | 37.0 |
Figure 2. Qualitative Results on Interactive Localization.
Some regions in the scene are replaced with different materials. The model correctly deactivates its prediction for replaced regions and reactivates them when the tactile signal is updated to match the new material.
Figure 3. The model interactively updates localization in response to changes in material and corresponding tactile inputs.