360+x: A Panoptic Multi-modal Scene Understanding Dataset

360+x : A Panoptic Multi-modal Scene Understanding Dataset

What is 360+x Dataset?

360+x dataset introduces a unique panoptic perspective to scene understanding, differentiating itself from existing datasets, by offering multiple viewpoints and modalities, captured from a variety of scenes. Our dataset contains:

2,152 multi-model videos captured by 360° cameras and Spectacles cameras (8,579k frames in total)
Capture in 17 cities across 5 countries.
Capture in 28 Scenes from Artistic Spaces to Natural Landscapes.
Temporal Activity Localisation Labels for 38 action instances for each video.

Dataset Example

Example of 360° Panoramic Video (temporally cropped to 30s)

Example of Front-view Video (temporally cropped to 30s)

Example of Egocentric Video

;

The Mel Spectrogram of the 360° Panoramic Audio

Example of 360 Panoramic Audio

The Mel Spectrogram of the Egocentric Audio

Example of Egocentric Audio

Panorama Preview

We have included a Panorama Preview to showcase the equirectangular projection applied to the panoramic video in the Dataset Example. The original spherical state of the video is demonstrated below, providing an initial view of its unaltered form before undergoing the projection process。

The images displayed are downsampled to 2880*1440 due to the limitations of the website display.

Dataset Statistics

We provide more examples of the dataset, including 360° panoramic videos, third-person front view videos, egocentric monocular videos, egocentric binocular videos, location, textual scene description and corresponding annotations. Our dataset comprises 28 scene categories, captured from 15 indoor scenes and 13 outdoor scenes, totalling 2,152 videos, capturing a wide range of environments. . Among these, 464 videos were captured using the 360 cameras, while the remaining 1,688 were recorded with the Spectacles cameras. To facilitate analysis and accessibility, these extended videos have been segmented into 1,380 shorter clips, each spanning approximately 10 seconds. In summary, these clips accumulate to a total duration of approximately 244,000 seconds (around 67.78 hours), and the total number of frames is 8,579K.

The following figures provide data analysis of the dataset, including the distribution of scene categories, the distribution of action instances, the capture time of the day, the number of action instances per video, the binaural delay per clip and the overall binaural delay histogram.

Geographical distribution with analysis of the action instances

Distribution of the scene categories (number)

Distribution of the action instances (duration)

Comparison of temporal the number of action instances per video between existing datasets and ours

Capture time of the day

Distribution comparison of the number of action instances per video.

Binaural delay per clip

Overall binaural delay histogram

Download and Dataset Structure

Our dataset offers a comprehensive collection of panoramic videos, binocular videos, and third-person videos, each pair of videos accompanied by annotations. Additionally, it includes features extracted using I3D, VGGish, and ResNet-18. Given the high-resolution nature of our dataset (5760x2880 for panoramic， 2432x1216 binocular videos,and 1920x1080 for third-person front view videos), the overall size is considerably large. To accommodate diverse research needs and computational resources, we also provide a lower-resolution version of the dataset (640x320 for panoramic and binocular videos, 569x320 for third-person front view videos) available for download.

Our full dataset, can be accessed by this link 🤗.

For easily accessing, the lower-resolution version can be accessed in this link 🤗

Dataset Structure

360x_dataset_original_resolution.zip

index.json
panoramic

0000001.mp4
0000002.mp4
...

binocular

0000001.mp4
0000002.mp4
...

third_person

0000001.mp4
0000002.mp4
...

classes.json
activity_segmentation

0000001.json
0000002.json
...

360x_dataset_low_resolution.zip

index.json
panoramic

0000001.mp4
0000002.mp4
...

binocular

0000001.mp4
0000002.mp4
...

third_person

0000001.mp4
0000002.mp4
...

More Examples

We provide more examples of the dataset, one set for each scene. For website display, each video has been resized to 320p (640*320 for panoramic and binocular videos, 569*320 for third-person front view video). The original video resolution is 5k (5760x2880 for panoramic， 2432x1216 binocular videos,and 1920x1080 for third-person front view videos).

Citation

@inproceedings{chen2024x360,
  title={360+x: A Panoptic Multi-modal Scene Understanding Dataset},
  author={Chen, Hao and Hou, Yuqi and Qu, Chenyuan and Testini, Irene and Hong, Xiaohan and Jiao, Jianbo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}