# Predicting Video with VQVAE

###### Abstract

In recent years, the task of video prediction—forecasting future video given past video frames—has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video. In contrast to previous work that has largely emphasized highly constrained datasets, we focus on very diverse, large-scale datasets such as Kinetics-600. We predict video at a higher resolution on unconstrained videos, , than any other previous method to our knowledge. We further validate our approach against prior work via a crowdsourced human evaluation.

neurips_2020.bib

## 1 Introduction

When it comes to real-world image data, deep generative models have made substantial progress. With advances in computational efficiency and improvements in architectures, it is now feasible to generate high resolution, realistic images from vast and highly diverse datasets BrockDS19; RazaviOV19; karras2017progressive. Apart from the domain of images, deep generative models have also shown promise in other data domains such as music DielemanOS18; jukebox, speech synthesis oord2016wavenet, 3D voxels LiuGO18; NashW17, and text radford2019language. One particular fledgling domain is video.

While some work in the area of video generation clark2020adversarial; VondrickPT16; saito2018 has explored video synthesis—generating videos with no prior frame information—many approaches actually focus on the task of video prediction conditioned on past frames RanzatoSBMCC14; SrivastavaMS15; patraucean2015spatiotemporal; MathieuCL15; lee2018savp; babaeizadeh2018stochastic; OliuSE18; XiongL00L18; Xue0BF16; Finn16; luc2020. It can be argued that video synthesis is a combination of image generation and video prediction. In other words, one could decouple the problem of video synthesis into unconditional image generation and conditional video prediction from a generated image. Therefore, we specifically focus on video prediction in this paper. Potential computer vision applications of video forecasting include interpolation, anomaly detection, and activity understanding. More generally, video prediction also has more general implications for intelligent systems—the ability to anticipate the dynamics of the environment. The problem is thus also relevant for robotics and reinforcement learning Finn16; Ebert17; OhGLLS15; ha2018world; RacaniereWRBGRB17.

Approaches toward video prediction have largely skewed toward variations of generative adversarial networks MathieuCL15; lee2018savp; clark2020adversarial; VondrickPT16; luc2020. In comparison, we are aware of only a relatively small number of approaches which propose variational autoencoders babaeizadeh2018stochastic; Xue0BF16; denton18a, autoregressive models KalchbrennerOSD17; WeissenbornTU20, or flow based approaches KumarBEFLDK20. There may be a number of reasons for this situation. One is the explosion in the dimensionality of the input space. A generative model of video needs to model not only one image but tens of them in a coherent fashion. This makes it difficult to scale up such models to large datasets or high resolutions. In addition, previous work clark2020adversarial suggests that video prediction may be fundamentally more difficult than video synthesis; a synthesis model can generate simple samples from the dataset while prediction potentially forces the model to forecast conditioned on videos that are outliers in the distribution. Furthermore, most prior work has focused on datasets with low scene diversity such as Moving MNIST SrivastavaMS15, KTH KTH, or robotic arm datasets Finn16; Ebert17.
While there have been attempts to *synthesize* video at a high resolution clark2020adversarial, we know of no attempt—excluding flow based approaches—to *predict* unconstrained video beyond resolutions of 64x64.

In this paper we address the large dimensionality of video data through compression. Using Vector Quantized Variational Autoencoders (VQ-VAE) OordVK17, we can compress video into a space requiring only 1.3% of the bits expressed in pixels. While this compressed encoding is lossy, we can still reconstruct the original video from the latent representation with a high degree of fidelity. Furthermore, we can leverage the modularity of VQ-VAE and decompose our latent representation into a hierarchy of encodings, separating high-level, global information from details such as fine texture or small motions. Instead of training a generative model directly on pixel space, we can instead model this much more tractable discrete representation, allowing us to train much more powerful models, use large diverse datasets, and generate at a high resolution. While most prior work has focused on GANs, this discrete representation can also be modeled by likelihood-based models. Likelihood models in concept do not suffer from mode-collapse, instability in training, and lack of diversity of samples often witnessed in GANs denton18a; babaeizadeh2018stochastic; RazaviOV19. In this paper, we propose a PixelCNN augmented with causal convolutions in time and spatiotemporal self-attention to model this space of latents. In addition, because the latent representation is decomposed into a hierarchy, we can exploit this decomposition and train separate specialized models at different levels of the hierarchy.

Our paper makes four contributions. First, we demonstrate the novel application of VQ-VAE to video data. Second, we propose a set of spatiotemporal PixelCNNs to predict video by utilizing the latent representation learned with VQ-VAE. Third, we explicitly predict video at a higher resolution on than ever before on real-world, unconstrained video. Finally, we demonstrate the competitive performance of our model with a crowdsourced human evaluation.

4th Frame | 9th Frame | 16th Frame |

## 2 Background

Top Layer | Top + Bottom Layer | Original Frame |

### 2.1 Vector Quantized Autoencoders

VQ-VAEs OordVK17 are autoencoders which learn a discrete latent encoding for input data . First, the output of non-linear encoder , implemented by a neural network, is passed through a discretization bottleneck. is mapped via nearest-neighbor into a quantized codebook where is the dimensionality of each vector and is the number of categories in the codebook. The discretized representation is thus given by:

(1) |

Equation 1 is not differentiable; however, OordVK17 notes that copying the gradient of to is a suitable approximation similar to the straight-through estimator BengioLC13. A decoder , also implemented by a neural network, then reconstructs the input from . The total loss function for the VQ-VAE is thus:

(2) |

Where sg is a stop gradient operator, and is a parameter which regulates the rate of code change. As in previous work OordVK17; RazaviOV19, we replace the second term in equation 2 and learn the codebook via an exponential moving average of previous values during training:

(3) |

Where is a decay parameter and is the numbers of vectors in in a batch that will map to .

### 2.2 PixelCNN Models

PixelCNN and related models have shown promise in modeling a wide variety of data domains OordKK16; oord2016wavenet; KalchbrennerOSD17; WeissenbornTU20. These autoregressive models are likelihood-based—they explicitly optimize negative log-likelihood. They exploit the fact that the joint probability distribution input data can be factored into a product of conditional distributions for each dimension of the data:

(4) |

Where is the full dimensionality of the data. This factorization is implemented by a neural network, and the exact set of conditional dependencies is determined by the data domain. Image pixels may depend on regions above and to the left of them OordKK16, while temporal dimensions may depend on past dimensions oord2016wavenet; KalchbrennerOSD17; WeissenbornTU20.

## 3 Method

VQ-VAE Model | Predicting Video |

Our approach consists of two main components. First, we compress video segments into a discrete latent representation using a hierarchical VQ-VAE. We then propose a multi-stage autoregressive model based on the PixelCNN architecture, exploiting the low dimensionality of the compressed latent space and the hierarchy of the representation.

### 3.1 Compressing Video with VQ-VAE

Similar to RazaviOV19, we use VQ-VAE to compress video in a hierarchical fashion. This multi-stage composition of the latent representation allows decomposition of global, high level information from low-level details such as edges or fine motion. For image information RazaviOV19 this approach confers a number of advantages. First, the decomposition allows latent codes to specialize at each level. High level information can be represented in an even more compressed manner, and the total reconstruction error is lower. In addition, this hierarchy leads to a naturally modular generative model. We can then develop a generative model that specializes in modeling the high-level, global information. We can then train a separate model, conditioned on global information, that fills in the details and models the low-level information further down the hierarchy. In this paper, we adopt the terminology of RazaviOV19 and call the set of high-level latents the *top* layer and the low-level latents the *bottom* layer.

Consistent with the experimental setup of previous work in video prediction, we deal with 16-frame videos. Most of the videos in our training dataset are 25 frames per second. We use frames at a resolution. The full video voxel is thus . Using residual blocks with 3D convolutions, we downsample the video spatiotemporally. At the bottom layer, the video is downsampled to a quantized latent space of , reducing the spatial dimension by 4 and the temporal dimension by 2. Another stack of blocks reduces all dimensions by 2, with a top layer of . Each of the voxels in the layer is quantized into 512 codes with a different codebook for both layers.

The decoder then concatenates the bottom layer and the top layer after upsampling using transposed convolutions. From this concatentation as input, the decoder deterministically outputs the full video. Overall, we reduce a space down to a space, a greater than 98% reduction in bits required. During training, we randomly mask out the bottom layer in the concatenated input to the decoder. Masking encourages the model to utilize the top latent layer and prevent codebook collapse.

### 3.2 Predicting Video with PixelCNNs

With VQ-VAE, our video is now decomposed into a hierarchy of quantized latents at and . Previous autoregressive approaches involved full pixel videos at WeissenbornTU20 or KalchbrennerOSD17. Our latent representation is thus well within the range of tractability for these models. Furthermore, given the hierarchy, we can factorize our generative model into a coarse-to-fine fashion. We denote the model of the top layer the *top prior* and model of the bottom layer the *bottom prior*. Because we are focusing on video prediction, we emphasize that both are still conditioned on a series of input frames. While previous work used 5 frames and predicted 11 WeissenbornTU20; clark2020adversarial, the power-of-two design of our architecture leads us to condition on 4 and predict 12. When conditioning our prior models on these frames, we need not use a large stack directly on the original images but save memory and computation by training a smaller stack of residual layers on their latent representation, compressing these 4 conditional frames into a small latent space of and .

We first model the top layer with a conditional prior model. Our prior model is based on a PixelCNN with multi-head self attention layers ChenMRA18. We adapt this architecture by extending the PixelCNN into time; instead of a convolutional stack over a square image, we use a comparable 3D convolutional stack over the cube representing the prior latents. The convolutions are masked in the same way as the original PixelCNN in space—at each location in space, the convolutions only have access to information to the left and above them. In time, present and future timesteps are masked out, and convolutions only have access to previous timesteps. While spatiotemporal convolutions can be resource intensive, we can implement most of this functionality with 2D convolutions separately in the plane and the plane. We take 1D convolutions in the horizontal and vertical stacks in the original PixelCNN OordKK16 and add the extra dimension of time to them, making them 2D. Our only true 3D convolution is at the first layer before the addition of gated stacks. We use multi-head attention layers analogous to RazaviOV19; this time the attention is applied to a 3D voxel instead of a 2D layer as in RazaviOV19. Attention is applied every five layers. During sampling we can generate voxels left-to-right, top-to-bottom within each temporal step as in the original PixelCNN. Once a final step is generated, we can generate the next step conditioned on the previous generated steps.

Once we have a set of latents from the top layer, we can condition our bottom conditional prior model and generate the final bottom layer. Because the bottom layer has a higher number of dimensions and relies on local information, we don’t necessarily need a 3D PixelCNN. Instead, we use a 2D PixelCNN with multi-head self attention every five layers analogous to RazaviOV19. We implement a 3D conditional stack, however, that takes in a window of time steps from the top layer as well as a window of past generated time steps in the bottom layer. The window sizes we used were 4 and 2 respectively. This conditional stack is used as conditioning to the 2D PixelCNN at the current timestep.

## 4 Related Work

Prefer Video VQ-VAE | Prefer luc2020 | Indifferent |

65.7% | 12.8% | 21.5% |

#### Video Prediction and Synthesis:

In the last few years, the research community has focused a spotlight on the topic of video generation—either in the form of video synthesis or prediction. Early approaches involved direct, deterministic pixel prediction RanzatoSBMCC14; SrivastavaMS15; OhGLLS15; patraucean2015spatiotemporal. Given the temporal nature of video, such approaches often incorporated LSTMs. These papers usually applied their deterministic models on datasets such as moving MNIST characters SrivastavaMS15; because of their deterministic nature, rarely were they successfully applied to more complex datasets. Given this situation, researchers started to adapt popular models for image generation to the problem starting with generative adversarial models MathieuCL15; VondrickPT16; lee2018savp; babaeizadeh2018stochastic; clark2020adversarial; saito2018; luc2020; XiongL00L18, variational autoencoders Xue0BF16, and autoregressive models KalchbrennerOSD17; WeissenbornTU20. Others stepped aside from the problem of full pixel prediction and instead predicted pixel motion Finn16; Walker16; JiaBTG16 or a decomposition of pixels and motion denton18a; Gao_2019_ICCV; pmlr-v80-jang18a; hao2018controllable; LiFYWLY18; Tulyakov0YK18; VillegasICLR17. Finally, some have proposed a hierarchical approach based on structured information—generating video conditioned on text LiMSCC18, semantic segments LucNCVL17; LucCLV18, or human pose Walker17; VillegasICML17.

#### Compressing Data with Latents:

The key element in our video prediction framework is compression—representing videos through lower dimensional latents. We apply the framework of VQ-VAE OordVK17; RazaviOV19 which has been successfully applied to compress image and sound data. Related to VQ-VAE, other researchers have explored hierarchies of latents for generation of images defauw19 and music DielemanOS18.

#### Autoregressive Models:

The foundation of our model is based on PixelCNN OordKK16. Distinct from implicit likelihood models such as GANs and approximate methods such as VAEs, the family of PixelCNN architectures have shown promise in modeling a variety of data domains including images OordKK16, sound oord2016wavenet, and video KalchbrennerOSD17; WeissenbornTU20. In line with our paper, recent work with these models has shifted toward decomposing autoregression through hierarchies MenickK19; ReedOKCWCBF17 and latent compression OordVK17; RazaviOV19; jukebox.

## 5 Experiments

Method | FVD Score () |
---|---|

Video Transformer (64 64) WeissenbornTU20 | 170 5 |

DVD-GAN-FP (64 64) clark2020adversarial | 69.15 1.16 |

TRIVD-GAN-FP (64 64) luc2020 | 25.74 0.66 |

Video VQ-VAE (64 64) | 64.30 2.04 |

Video VQ-VAE FVD* (64 64) | 54.30 3.49 |

Video VQ-VAE (256 256) | 129.85 1.64 |

Video VQ-VAE FVD* (256 256) | 82.45 1.16 |

Final Conditioning Frame | 9th Frame | 16th Frame |

Final Conditioning Frame | 9th Frame | 16th Frame |

In this section, we evaluate our model quantitively and qualitatively on the Kinetics-600 dataset Kinetics600. This dataset of videos is very large and highly diverse, consisting of hundreds of thousands of videos selected from YouTube across 600 actions. While most previous work has focused on more constrained datasets, only a few clark2020adversarial; luc2020; WeissenbornTU20 have attempted to scale to larger size and complexity. We train our top and bottom models for around 1000000 iterations with a total batch sizes of 512 and 32 respectively. Our VQ-VAE model was trained on a batch size of 16 for 1000000 iterations.

### 5.1 Qualitative Evaluation

While the Kinetics-600 dataset is publicly available for use, the individual videos in the dataset may not be licensed for display in an academic paper. Therefore, in this paper, we apply our model trained on Kinetics-600 to videos licensed under Creative Commons from the YFCC100m dataset yfc100m. In figures 4 and 5 we show some selected predictions. We find that our approach is able to model camera perspective, parallax, inpainting, deformable human motions, and even aspects of crowd motion across a variety of different visual contexts.

### 5.2 Quantitative Evaluation

Quantitative evaluation of generative models of images, especially across different classes of models, is an open problem TheisOB15. It is even less explored in the realm of video prediction. One proposed metric used on larger datasets such as Kinetics-600 is the Fréchet Video Distance Unterthiner18. As no previous approach has attempted resolution, we downscale our videos to for a proper comparison against prior work. We also compute FVD on the full resolution as a baseline for future work. We use 2-fold cross-validation over 39000 samples to compute FVD. We show our results in Table 2. We find performance exceeds clark2020adversarial but not necessarily luc2020. We also find that comparing the VQ-VAE samples to the reconstructions, not the original videos, leads to an even better score (shown by FVD*). This result is similar to the results on images for VQ-VAE RazaviOV19. As GAN-based approaches are explicitly trained on classifier (discriminator) based losses, FVD—a metric based on a neural-network classifier—may favor GAN-based approaches versus log-likelihood based models even if the quality of the samples are comparable. Given the possible flaws in this metric, we also conduct a human evaluation similar to VondrickPT16. We had 15 participants compare up to 30 side-by-side videos generated from our approach and that of luc2020. Each video had at least 13 judgements. For each comparison, both models used the exact set of conditioning frames and had a resolution of 64x64. Participants could choose a preference for either video, or they could choose indifference—meaning the difference in quality between two videos is too close to perceive. We show our results in Table 1. Out of a total of 405 judgements, participants preferred ours 65.7% of the time, luc2020 12.8%, and 21.5% were judged to be too close in quality. Even though luc2020 has a much lower FVD score, we find that our participants had stronger preference for samples generated from our model.

## 6 Conclusion

In this paper we have explored the application of VQ-VAE towards the task of video prediction. With this learned compressed space, we can utilize powerful autoregressive models to generate possible future events in video at higher resolutions. We show that we are also able to achieve a level of performance comparable to contemporary GANs.

## Broader Impact

Video prediction has the potential for broad, beneficial impact in a variety of situations. This includes superior video quality and speed via interpolation and compression, smoother human-computer interaction via implicit action forecasting, general improvements to model-based reinforcement learning, and cases in robotics where a model of the environment is needed. However, we are aware that this technology can be abused. Video generation could potentially be used to fabricate information or falsely portray individuals to deceptive or malicious social ends.

There is already ongoing successful work in combating malicious applications of image and video generation—specifically the automatic detection of computer generated media afchar2018; sabir2019; xinsheng2019. These types of detection efforts are particularly effective on the approach set out in this paper. By compressing videos through a bottleneck and removing high frequency details, our general videos are easily identifiable by an image classifier, reducing the risk of malicious use. As research in generative models progresses, it is essential that the field of visual forensics HuhLOE18—progresses in tandem.

Finally, as clark2020adversarial notes, showing video samples directly from the Kinetics dataset raises privacy concerns; the Kinetics dataset consists of videos uploaded from YouTube. Direct release of frames from this dataset raises questions of ethics, and generative models trained on this dataset could possibly memorize videos shown in the dataset. We address this issue and set a new precedent for future research by restricting qualitative results to videos licensed explicitly under Creative Commons by their authors.

## 7 Appendix

### 7.1 Architectures and Hyperparameters

Parameter | Value |

Input Size | 256256163 |

Latent layers | 32324, 64648 |

(commitment loss coefficient) | 0.25 |

Batch size | 16 |

Hidden units | 128 |

Residual units | 32 |

Layers | 2 |

Codebook size | 512 |

Codebook dimension | 64 |

First Stage Encoder twh-Conv Filter Size | 4 8 8 |

First Stage Encoder twh-Conv Filter Stride | 2 4 4 |

Second Stage Encoder twh-Conv Filter Size | 4 4 4 |

Second Stage Encoder twh-Conv Filter Stride | 2 2 2 |

Upsampling twh-Conv Filter Size | 4 8 8 |

Upsampling twh-Conv Filter Stride | 2 4 4 |

Training steps | 1000000 |

Parameter | Top Prior | Bottom Prior |
---|---|---|

Input size | 32323 | 6464 |

Batch size | 512 | 32 |

Hidden units | 512 | 512 |

Residual units | 1024 | 1024 |

Layers | 40 | 20 |

Attention layers | 8 | 4 |

Attention heads | 8 | 8 |

Conv Filter size | 3 | 5 |

Dropout | 0.5 | 0.0 |

Training steps | 1016000 | 950000 |

Parameter | Values |
---|---|

Input size | 3232 (upsampled to 64642), 64642 |

Hidden units | 512 |

Residual units | 128 |

Layers | 4 |

Parameter | Values |
---|---|

Input size | 32324, 64644 |

Hidden units | 1024 |

Residual Blocks | 20 |

### 7.2 Attributions

All videos shown in this paper are licensed under Creative Commons BY 2.0 which can be accessed here. All videos have been cropped and modified by our algorithm.

Figure 4:

“Cutting Shrimp” by jencu.
Accessible here

“New York City December 2010” by Duncan Currier.
Accessible here

“How to wrap beef currie pies” by Joy.
Accessible here

“dave skiing 2” by Leigh Blackall.
Accessible here

“Ashokan Farewell played as a trio” by Bryn Pinzaguer.
Accessible here

Figure 5

“Dan skiing.” by Deana Hunter:
Accessible here

“Every Saturday in Canada is Hockey night - watching this makes me want to play hockey
again!” by Roland Tanglao.
Accessible here

“The Cable Car” by Aaron Tait.
Accessible here