Long-term Temporally Consistent Video Translation from Surgical 3D Data

Dominik Rivoir, Micha Pfeiffer, Reuben Docea, Fiona Kolbinger, Carina Riediger, Jürgen Weitz, Stefanie Speidel.

IEEE/CVF International Conference on Computer Vision, 2021.

[paper] [code] [data] [demo (data usage)] [article]

Contact: dominik.rivoir [at] nct-dresden.de

News!

Updated dataset with camera poses and demos for data usage.
RSIP Vision's Computer Vision News magazine featured this paper in their BEST OF ICCV selection.
Style-conditioned dataset coming soon...

Description

We designed a method for rendering long-term temporally consistent and realistic video sequences from simulated surgical 3D scenes. By generating realistic data from simulations, we are able to provide rich ground truth which would otherwise be very challenging to obtain (e.g. camera poses, 3D geometry, point correspondences). This way, we provide data that is potentially useful for evaluating or training tasks such as SLAM, visual servoing or tracking in a surgical setting. On this page, video examples, code and our complete generated dataset can be found.

Data

We rendered 21 000 randomly sampled views and 11 manually crafted sequences from 7 simulated, surgical 3D scenes. In the downloadable 7-zip-files, the generated images are separated by scenes.

Liver meshes were obtained from the IRCAD 3D CT liver data set and deformed to resemble intra-operative settings.

Random Views

Download	Details	Comments	Size
Reference (Input) Images	a_ref, Simple renderings of 21 000 random views (7 simulated scenes with 3000 views each).	These reference images were used as partial input to the translation module (next to the projected texture features).	ca. 1.3 GB
Translations	b̂, Realistic, view-consistent translation of each view.	The model was trained to render textures with consistent hue while allowing for varying brightness (view-dependent effects).	ca. 3.2 GB
Labels: Segmentation	Full segmentation masks.	Classes: liver (blue), fat/stomach (red), abdominal wall (green), gallbladder (cyan), ligament (magenta).	ca. 25 MB
Labels: Depth	Depth maps.	Stored as HDR gray-scale images.	ca. 1 GB
Labels: 3D-Coordinates	Corresponding 3D surface coordinates for each pixel.	Stored as HDR color images. Can be used to obtain dense pixel correspondences between views (e.g. for optical flow, tracking, SLAM). The model was trained to render textures with consistent hue while allowing for varying brightness (view-dependent effects). See the demo for usage.	ca. 25 GB
Labels: Camera Poses	Camera poses.	Stored as 7D vectors (3 for location, 4 for quaterion rotation). See the demo for usage.	ca. 800 kB

Sequences

Apart from rendering randomly sampled views, we also manually generated 11 sequences of 100 frames at 5 fps each. 7 of those sequences were used to obtain the quantitative results in the paper (directory names: scene_1, scene_2, ...). The other 4 were only used for qualitative anaylsis (scene_1_qual, ...).

Download	Details	Comments	Size
Reference Images (Sequences)	a_ref, Simple renderings of 11 sequences (100 frames at 5fps each).	These reference images were used as partial input to the translation module (next to the projected texture features).	ca. 60 MB
Translations (Sequences)	b̂, Realistic, view-consistent translation of each view.	The model was trained to render textures with consistent hue while allowing for varying brightness (view-dependent effects).	ca. 160 MB
Labels: Segmentation (Sequences)	Full segmentation masks.	Classes: liver (blue), fat/stomach (red), abdominal wall (green), gallbladder (cyan), ligament (magenta).	ca. 1.5 MB
Labels: Depth (Sequences)	Depth maps.	Stored as HDR gray-scale images.	ca. 50 MB
Labels: 3D-Coordinates (Sequences)	Corresponding 3D surface coordinates for each pixel.	Stored as HDR color images. Can be used to obtain dense pixel correspondences between views (e.g. for optical flow, tracking, SLAM). The model was trained to render textures with consistent hue while allowing for varying brightness (view-dependent effects). See the demo for usage.	ca. 1.3 GB
Labels: Camera Poses	Camera poses.	Stored as 7D vectors (3 for location, 4 for quaterion rotation). See the demo for usage.	ca. 35 kB

Comparison to State of the art and Ablations

Simulated Scene 6	Simulated Scene 5

Simulated Scene 1	Simulated Scene 1

Simulated Scene 3

Publication

If you use our data or code, please cite the following paper:

"Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data".
Dominik Rivoir, Micha Pfeiffer, Reuben Docea, Fiona Kolbinger, Carina Riediger, Jürgen Weitz, Stefanie Speidel.
International Conference on Computer Vision 2021.

Code

The code for this project can be found on GitLab.

It is based on Pfeiffer et al.'s framework for surgical image translation (Code, Paper) and the Multimodal UNsupervised Image-to-image Translation (MUNIT) framework (Code, Paper).

Contact

For questions and comments, please contact dominik.rivoir [at] nct-dresden.de

Links

OpenCAS: Open collection of datasets for computer-assisted surgery systems


NCT/UCC Dresden	Centre for Tactile Internet, TU Dresden	University Hospital Dresden	German Cancer Research Center