Hawaii Convention Center, Honolulu HI, USA
19th Oct, 2025
j.jiao@bham.ac.uk
ICCV 2025 Honolulu Hawaii

BinEgo‑360°: Binocular Egocentric-360° Multi-modal Scene Understanding in the Wild

Welcome to the BinEgo‑360° Workshop & Challenge at ICCV 2025. We bring together researchers working on 360° panoramic and binocular egocentric vision to explore human‑like perception across video, audio, and geo‑spatial modalities.

Overview

This half-day workshop mainly looks at multi-modal scene understanding and perception in a human-like way. Specifically, we will focus on binocular/stereo egocentric and 360° panoramic perspectives, which measure both first-person views and third-person panoptic views, mimicking a human in the scene, by combining with multi‑modal cues such as spatial audio, textual descriptions, and geo‑metadata. This workshop will cover but not be limited to the following topics:

Keynote Speakers

Addison Lin Wang

Addison Lin Wang

Nanyang Technological University

Dima Damen

Dima Damen

University of Bristol

Bernard Ghanem

Bernard Ghanem

King Abdullah University of Science and Technology

Workshop Programme (Half‑day)

TimeSession
09:00 – 09:30Opening Remarks
09:30 – 10:05Keynote Talk 1
10:05 – 10:40Keynote Talk 2
10:40 – 11:00Break & Poster Session
11:00 – 11:45Invited paper Presentations (×3)
11:45 – 12:20Keynote Talk 3
12:20 – 12:35Awards Ceremony & Concluding Remarks

Call for Papers

We will invite papers from the ICCV 2025 main conference. All of the papers will be with related research topics to this workshop, and will be reviewed to assess the suitability/relevance to the workshop. If you are interested in presenting your work at our workshop, please fill in this form (submission deadline July 13th):

BinEgo‑360° Challenge

The challenge uses our public dataset 360+x for training/validation, and a held-out test set for the evaluation.

For more details about the dataset, tracks, timeline, and submission rules, please see below:

Dataset Overview

Dataset montage
  • 2,152 videos – 8.579 M frames / 67.78 h.
  • Viewpoints: 360° panoramic, binocular & monocular egocentric, third‑person front.
  • Modalities: RGB video, 6‑channel spatial audio, GPS + weather, text scene description.
  • Annotations: 38 action classes, temporal segments; object bounding boxes.
  • Resolution: 5 K originals (5 760 × 2 880 pano).
  • License: CC BY‑NC‑SA 4.0. All faces auto‑blurred.

Challenge Tracks & Baselines

1 · Classification

Predict the scene label for a whole clip. We follow the scene categories provided in the dataset.

  • Input: 360° RGB + egocentric RGB + audio/binaural delay.
  • Output: The scene label.
  • Metric: Top‑1 Accuracy (in test set).

Baseline (All views and modalities use)

Top‑1 Acc: 80.62 %

2 · Temporal Action Localization

Detect the start and end time of every action instance inside a clip.

  • Input: Same modalities as Track 1
  • Output: JSON output for each detection:{"video_id": ..., "t_start": ..., "t_end": ..., "label": ...}
  • Metric: mAP averaged over IoU ∈ {0.5, 0.75, 0.95}.

Baseline (TriDet + VAD)

MetricScore
mAP@0.527.1
mAP@0.7518.7
mAP@0.957.0
Average17.6

Timeline (Anywhere on Earth)

  1. 1 Jun 2025Dataset & baselines release; Kaggle opens
  2. 6 Jul 2025Submission deadline
  3. Sep 2025Winner slides/posters due
  4. 19-20 Oct 2025Awards & talks at ICCV 2025 workshop

Submission Rules

  1. Teams (≤ 5 members) register on Kaggle and fill in the team form.
  2. Up to 5 submissions per track per team – the last one counts.
  3. The winners need to submit a technical report and a poster to be presented at the workshop
  4. No external data that overlaps with the hidden test clips.
  5. Any submission after the deadline will not be considered.

Prizes & Sponsors

Sponsored by Insta360 · Allsee

Ethics & Broader Impact

All videos were recorded in public or non‑sensitive areas with informed participant consent. Faces are automatically blurred, and the dataset is released for non‑commercial research under CC BY‑NC‑SA 4.0. We prohibit any re‑identification, surveillance or commercial use. By advancing robust multi‑modal perception, we aim to benefit robotics, AR/VR and assistive tech while upholding fairness and privacy.

Organisers

Jianbo Jiao

Jianbo Jiao

University of Birmingham

Shangzhe Wu

Shangzhe Wu

University of Cambridge

Dylan Campbell

Dylan Campbell

Australian National University

Yunchao Wei

Yunchao Wei

Beijing Jiaotong University

Lu Qi

Lu Qi

Insta360

Yasmine Mellah

Yasmine Mellah

Audioscenic

Aleš Leonardis

Aleš Leonardis

University of Birmingham

Technical Committee:

Chenyuan Qu

Chenyuan Qu

University of Birmingham

Han Hu

Han Hu

University of Birmingham

Qiming Huang

Qiming Huang

University of Birmingham

Hao Chen

Hao Chen

University of Cambridge

Contact: j.jiao@bham.ac.uk

Sponsors

We gratefully acknowledge the generous support of our sponsors.

Insta360Allsee

Publication(s)

If you use the 360+x dataset or participate in the challenge, please cite:

@inproceedings{chen2024x360,
  title     = {360+x: A Panoptic Multi-modal Scene Understanding Dataset},
  author    = {Chen, Hao and Hou, Yuqi and Qu, Chenyuan and Testini, Irene and Hong, Xiaohan and Jiao, Jianbo},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2024}
}