Seeing Through Touch

Abstract

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

Overview

Tactile Localization

We introduce tactile localization, a task where a model is given a tactile input and must identify all regions in an image that share the same material properties. Unlike prior visuo-tactile methods that determine whether an image and a touch correspond to the same material, our model localizes where in a visual scene a given tactile sensation exists — producing dense saliency maps that highlight material regions matching the touch input. This cross-modal task requires fine-grained spatial reasoning beyond the coarse semantic alignment addressed by existing approaches.

Framework

Seeing Through Touch (STT) encodes paired tactile and visual inputs into a shared feature space and learns fine-grained local cross-modal alignment via contrastive learning. Rather than comparing global pooled representations, STT computes dense similarity maps between aggregated tactile features and local visual features, enabling spatially precise material localization.

Web-Material Dataset

32K images 18 categories Multi-material scenes Pixel-level annotations

Existing visuo-tactile datasets consist mostly of close-up, single-material images that lack scene-level diversity. We introduce Web-Material, a new dataset comprising web-crawled scene-level images and curated samples from prior material recognition datasets, together covering 32K images across 18 categories. Each image captures materials in diverse real-world contexts, and is paired with tactile signals from Touch-and-Go[1] via material diversity-based pairing. Web-Material serves as both a training source and an evaluation benchmark with pixel-level material segmentation annotations.

Results

Tactile Localization

Given a tactile input, the model identifies all image regions that share the same material properties.

Model	TG-Test[1]		Web-Material		OpenSurfaces[2]
Model	mAP	mIoU	mAP	mIoU	mAP	mIoU
Visual Heatmap
DINOv3 Att. Map	83.74	74.27	62.73	47.12	18.91	19.04
Global Alignment
TVL w/o Language[3]	70.61	68.12	32.16	32.16	17.93	18.61
STT-CLS	73.63	73.49	39.35	34.74	17.98	19.07
Local Alignment
STT-Local	85.12	76.79	67.72	52.34	37.25	29.47
STT-Indomain	86.95	77.58	71.33	55.73	42.54	34.10
SeeingThroughTouch (STT)	87.56	76.82	77.43	60.94	48.06	36.73

TG-Test

Web-Material

OpenSurfaces

Figure 1. Qualitative Tactile Localization Results on TG-Test, Web-Material, and OpenSurfaces.

Interactive Localization

The model correctly shifts its predicted region when a different tactile signal is provided for the same scene.

Model	IIoU
Baselines
TVL	1.0
Ours
STT-CLS	4.0
STT-Local	30.0
STT-Indomain	32.0
SeeingThroughTouch (STT)	37.0

Interactive Localization Qualitative Results

Figure 2. Qualitative Results on Interactive Localization.

Additional Results

Some regions in the scene are replaced with different materials. The model correctly deactivates its prediction for replaced regions and reactivates them when the tactile signal is updated to match the new material.

Figure 3. The model interactively updates localization in response to changes in material and corresponding tactile inputs.

Touch signals are weaker at the start and end of a contact than in the firm middle. Without material diversity-based pairing (M.D.P.), accuracy drops on these weak frames. Applying M.D.P. with out-domain images closes this gap and makes weak-signal performance nearly as strong as firm contact.

Model	M.D.P.	Start		Middle		End
Model	M.D.P.	mAP	mIoU	mAP	mIoU	mAP	mIoU
TG-Test
STT-Local	✗	81.31	72.67	85.12	76.79	81.96	72.68
STT-Indomain	In-domain	86.34	76.15	86.95	77.58	85.51	74.60
STT	Out-domain	86.20	74.56	87.56	76.82	84.54	73.57
Web-Material
STT-Local	✗	64.45	49.45	69.60	54.69	61.52	48.72
STT-Indomain	In-domain	69.45	53.83	71.14	56.33	67.31	53.24
STT	Out-domain	76.19	59.99	78.98	62.08	75.08	58.56
OpenSurfaces
STT-Local	✗	33.63	26.77	40.14	32.54	35.86	28.60
STT-Indomain	In-domain	40.73	32.24	44.39	35.11	39.08	31.45
STT	Out-domain	45.33	35.00	55.06	42.12	44.57	34.20

Touch-and-Go weak tactile signal examples over time

Figure 4. Tactile signals are typically weaker at the start and end of a touch instance and strongest in the middle. Models with material diversity-based pairing achieve robust localization regardless of signal strength.

Beyond localization, STT also learns strong material features. Under linear probing on the original Touch-and-Go split, it achieves competitive material classification accuracy among recent tactile representation learning methods, standing out in particular among those trained on a comparable amount of data.

Model	Material (%)
VT CMC	54.7
MViTac	57.6
UniTouch^†	61.3
VIT-LENS-2	63.0
TLV-Link	67.2
OmniBind	67.45
SeeingThroughTouch (STT)	67.77
AnyTouch^†	82.74

^† Trained with large-scale multi-source tactile data.

BibTeX

@inproceedings{kim2026seeingthroughtouch, author = {Seongyu Kim and Seungwoo Lee and Hyeonggon Ryu and Joon Son Chung and Arda Senocak}, title = {Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, }

References

Fengyu Yang et al., "Touch and Go: Learning from Human-Collected Vision and Touch," NeurIPS Datasets and Benchmarks Track, 2022.
Sean Bell et al., "OpenSurfaces: A Richly Annotated Catalog of Surface Appearance," SIGGRAPH, 2013.
Letian Fu et al., "A Touch, Vision, and Language Dataset for Multimodal Alignment," ICML, 2024.