360+x : A Panoptic Multi-modal Scene Understanding Dataset

The MIx Group, University of Birmingham

The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

(Oral Presentation)

What is 360+x Dataset?

360+x dataset introduces a unique panoptic perspective to scene understanding, differentiating itself from existing datasets, by offering multiple viewpoints and modalities, captured from a variety of scenes. Our dataset contains:

  • 2,152 multi-model videos captured by 360° cameras and Spectacles cameras (8,579k frames in total)
  • Capture in 17 cities across 5 countries.
  • Capture in 28 Scenes from Artistic Spaces to Natural Landscapes.
  • Temporal Activity Localisation Labels for 38 action instances for each video.

Dataset Example

Panorama Preview

We have included a Panorama Preview to showcase the equirectangular projection applied to the panoramic video in the Dataset Example. The original spherical state of the video is demonstrated below, providing an initial view of its unaltered form before undergoing the projection process。

The images displayed are downsampled to 2880*1440 due to the limitations of the website display.

Dataset Statistics

We provide more examples of the dataset, including 360° panoramic videos, third-person front view videos, egocentric monocular videos, egocentric binocular videos, location, textual scene description and corresponding annotations. Our dataset comprises 28 scene categories, captured from 15 indoor scenes and 13 outdoor scenes, totalling 2,152 videos, capturing a wide range of environments. . Among these, 464 videos were captured using the 360 cameras, while the remaining 1,688 were recorded with the Spectacles cameras. To facilitate analysis and accessibility, these extended videos have been segmented into 1,380 shorter clips, each spanning approximately 10 seconds. In summary, these clips accumulate to a total duration of approximately 244,000 seconds (around 67.78 hours), and the total number of frames is 8,579K.

The following figures provide data analysis of the dataset, including the distribution of scene categories, the distribution of action instances, the capture time of the day, the number of action instances per video, the binaural delay per clip and the overall binaural delay histogram.

Download and Dataset Structure

Our dataset offers a comprehensive collection of panoramic videos, binocular videos, and third-person videos, each pair of videos accompanied by annotations. Additionally, it includes features extracted using I3D, VGGish, and ResNet-18. Given the high-resolution nature of our dataset (5760x2880 for panoramic and binocular videos, 1920x1080 for third-person front view videos), the overall size is considerably large. To accommodate diverse research needs and computational resources, we also provide a lower-resolution version of the dataset (640x320 for panoramic and binocular videos, 569x320 for third-person front view videos) available for download.

Our full dataset, can be accessed by filling this ACCESS FORM and the further instruction will send to your email address shortly.

For easily accessing, the lower-resolution version can be accessed in Huggingface🤗

Dataset Structure
360x_dataset_original_resolution.zip
  • index.json
  • panoramic
    • 0000001.mp4
    • 0000002.mp4
    • ...
  • binocular
    • 0000001.mp4
    • 0000002.mp4
    • ...
  • third_person
    • 0000001.mp4
    • 0000002.mp4
    • ...
  • classes.json
  • activity_segmentation
    • 0000001.json
    • 0000002.json
    • ...
360x_dataset_low_resolution.zip
  • index.json
  • panoramic
    • 0000001.mp4
    • 0000002.mp4
    • ...
  • binocular
    • 0000001.mp4
    • 0000002.mp4
    • ...
  • third_person
    • 0000001.mp4
    • 0000002.mp4
    • ...

More Examples

We provide more examples of the dataset, one set for each scene. For website display, each video has been resized to 320p (640*320 for panoramic and binocular videos, 569*320 for third-person front view video). The original video resolution is 5k (5760*2880 for panoramic and binocular videos, 1920*1080 for third-person front view video).

Citation

@inproceedings{chen2024x360,
  title={360+x: A Panoptic Multi-modal Scene Understanding Dataset},
  author={Chen, Hao and Hou, Yuqi and Qu, Chenyuan and Testini, Irene and Hong, Xiaohan and Jiao, Jianbo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}