BinEgo‑360°: Binocular Egocentric-360° Multi-modal Scene Understanding in the Wild
Welcome to the BinEgo‑360° Workshop & Challenge at ICCV 2025. We bring together researchers working on 360° panoramic and binocular egocentric vision to explore human‑like perception across video, audio, and geo‑spatial modalities.
Overview
This half-day workshop mainly looks at multi-modal scene understanding and perception in a human-like way. Specifically, we will focus on binocular/stereo egocentric and 360° panoramic perspectives, which measure both first-person views and third-person panoptic views, mimicking a human in the scene, by combining with multi‑modal cues such as spatial audio, textual descriptions, and geo‑metadata. This workshop will cover but not be limited to the following topics:
- Embodied 360° scene understanding & egocentric visual reasoning
- Multi-modal scene understanding
- Stereo Vision
- Open‑world learning & domain adaptation
Keynote Speakers
Workshop Programme (Half‑day)
Time | Session |
---|---|
09:00 – 09:30 | Opening Remarks |
09:30 – 10:05 | Keynote Talk 1 |
10:05 – 10:40 | Keynote Talk 2 |
10:40 – 11:00 | Break & Poster Session |
11:00 – 11:45 | Invited paper Presentations (×3) |
11:45 – 12:20 | Keynote Talk 3 |
12:20 – 12:35 | Awards Ceremony & Concluding Remarks |
Call for Papers
We will invite papers from the ICCV 2025 main conference. All of the papers will be with related research topics to this workshop, and will be reviewed to assess the suitability/relevance to the workshop. If you are interested in presenting your work at our workshop, please fill in this form (submission deadline July 13th):
BinEgo‑360° Challenge
The challenge uses our public dataset 360+x for training/validation, and a held-out test set for the evaluation.
- Winners of the challenge will be invited to jointly write a paper about the workshop & challenge to be included in the ICCV proceedings.
Dataset Overview

- 2,152 videos – 8.579 M frames / 67.78 h.
- Viewpoints: 360° panoramic, binocular & monocular egocentric, third‑person front.
- Modalities: RGB video, 6‑channel spatial audio, GPS + weather, text scene description.
- Annotations: 38 action classes, temporal segments; object bounding boxes.
- Resolution: 5 K originals (5 760 × 2 880 pano).
- License: CC BY‑NC‑SA 4.0. All faces auto‑blurred.
Challenge Tracks & Baselines
1 · Classification
Predict the scene label for a whole clip. We follow the scene categories provided in the dataset.
- Input: 360° RGB + egocentric RGB + audio/binaural delay.
- Output: The scene label.
- Metric: Top‑1 Accuracy (in test set).
Baseline (All views and modalities use)
Top‑1 Acc: 80.62 %
2 · Temporal Action Localization
Detect the start and end time of every action instance inside a clip.
- Input: Same modalities as Track 1
- Output: JSON output for each detection:
{"video_id": ..., "t_start": ..., "t_end": ..., "label": ...}
- Metric: mAP averaged over IoU ∈ {0.5, 0.75, 0.95}.
Baseline (TriDet + VAD)
Metric | Score |
---|---|
mAP@0.5 | 27.1 |
mAP@0.75 | 18.7 |
mAP@0.95 | 7.0 |
Average | 17.6 |
Timeline (Anywhere on Earth)
- 1 Jun 2025Dataset & baselines release; Kaggle opens
- 6 Jul 2025Submission deadline
- Sep 2025Winner slides/posters due
- 19-20 Oct 2025Awards & talks at ICCV 2025 workshop
Submission Rules
- Teams (≤ 5 members) register on Kaggle and fill in the team form.
- Up to 5 submissions per track per team – the last one counts.
- The winners need to submit a technical report and a poster to be presented at the workshop
- No external data that overlaps with the hidden test clips.
- Any submission after the deadline will not be considered.
Prizes & Sponsors
- Hardware: Insta360 X5 panoramic camera
- Gift: Amazon / Taobao vouchers
Ethics & Broader Impact
All videos were recorded in public or non‑sensitive areas with informed participant consent. Faces are automatically blurred, and the dataset is released for non‑commercial research under CC BY‑NC‑SA 4.0. We prohibit any re‑identification, surveillance or commercial use. By advancing robust multi‑modal perception, we aim to benefit robotics, AR/VR and assistive tech while upholding fairness and privacy.
Organisers
Technical Committee:
Contact: j.jiao@bham.ac.uk
Sponsors
We gratefully acknowledge the generous support of our sponsors.
Publication(s)
If you use the 360+x dataset or participate in the challenge, please cite:
@inproceedings{chen2024x360, title = {360+x: A Panoptic Multi-modal Scene Understanding Dataset}, author = {Chen, Hao and Hou, Yuqi and Qu, Chenyuan and Testini, Irene and Hong, Xiaohan and Jiao, Jianbo}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2024} }