# Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

POSTECH

## Abstract

We propose the first reference-based video super-resolution (RefVSR) approach that utilizes reference videos for high-fidelity results. We focus on RefVSR in a triple-camera setting, where we aim at super-resolving a low-resolution ultra-wide video utilizing wide-angle and telephoto videos. We introduce the first RefVSR network that recurrently aligns and propagates temporal reference features fused with features extracted from low-resolution frames. To facilitate the fusion and propagation of temporal reference features, we propose a propagative temporal fusion module. For learning and evaluation of our network, we present the first RefVSR dataset consisting of triplets of ultra-wide, wide-angle, and telephoto videos concurrently taken from triple cameras of a smartphone. We also propose a two-stage training strategy fully utilizing video triplets in the proposed dataset for real-world 4x video super-resolution. We extensively evaluate our method, and the result shows the state-of-the-art performance in 4x super-resolution.

## Our RefVSR Framework

Our network adopts a bidirectional recurrent pipeline to recurrently align and propagate Ref features that are fused with the features of LR frames. Our network is efficient in terms of computation and memory consumption because the global matching needed for aligning Ref features is performed only between a pair of LR and corresponding Ref frames at each time step. Still, our network is capable of utilizing temporal Ref frames, as the aligned Ref features are continuously fused and propagated in the pipeline.

## Propagative Temporal Fusion

As a key component for managing Ref features in the pipeline, we propose a propagative temporal fusion module that fuses and propagates only well-matched Ref features. The module leverages the matching confidence computed during the global matching between LR and Ref features as the guidance to determine well-matched Ref features to be fused and propagated. The module also accumulates the matching confidence throughout the pipeline and uses the accumulated value as the guidance when fusing the propagated temporal Ref features.

## RealMCVSR Dataset

To train our network, we propose the RealMCVSR dataset, which consists of triplets of ultra-wide, wide-angle, and telephoto videos, where wide-angle and telephoto videos have the same size as ultra-wide videos, but their resolutions are 2x and 4x that of ultra-wide videos. Videos triplets are concurrently recorded by Apple iPhone 12 Pro Max equipped with triple cameras having fixed focal length. We also built an iOS app that concurrently captures video triplest and provides with exposure syncing functionality.

## Two-stage Training Strategy for Real-World 4x VSR

Our training strategy consists of pre-training and adaptation stages. In the pre-training stage, we downsample ultra-wide and wide-angle videos 4x. We then train the network to 4x super-resolve a downsampled ultra-wide video using a downsampled wide-angle video as a Ref. The training is done in a supervised manner using the original ultra-wide video as the ground-truth. Finally, in the adaptation stage, we fine-tune the network to adapt it to real-world videos of the original sizes. This stage uses a telephoto video as supervision to train the network to recover high-frequency details of a telephoto video. Refer the paper for details of our training strategy and experiments that validate the strategy.

## BibTeX

@InProceedings{Lee2022RefVSR,
author    = {Junyong Lee and Myeonghee Lee and Sunghyun Cho and Seungyong Lee},
title     = {Reference-based Video Super-Resolution Using Multi-Camera Video Triplets},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year      = {2022},
}