Pytorch unofficial implementation of MoViNets: Mobile Video Networks for Efficient Video Recognition.
Authors: Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong (Google Research)
[Authors' Implementation]
This fork is modified to be able to export to TorchScript and ONNX.
It is required to clean the buffer after all the clips of the same video have been processed.
model.clean_activation_buffers()
Click on "Open in Colab" to open an example of training on HMDB-51
pip install git+https://github.com/Atze00/MoViNet-pytorch.git
Use causal = True to use the model with stream buffer, causal = False will use standard convolutions
from movinets import MoViNet
from movinets.config import _C
MoViNetA0 = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
MoViNetA1 = MoViNet(_C.MODEL.MoViNetA1, causal = True, pretrained = True )
...Use pretrained = True to use the model with pretrained weights
"""
If pretrained is True:
num_classes is set to 600,
conv_type is set to "3d" if causal is False, "2plus1d" if causal is True
tf_like is set to True
"""
model = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
model = MoViNet(_C.MODEL.MoViNetA0, causal = False, pretrained = True )Training loop with stream buffer
def train_iter(model, optimz, data_load, n_clips = 5, n_clip_frames=8):
"""
In causal mode with stream buffer a single video is fed to the network
using subclips of lenght n_clip_frames.
n_clips*n_clip_frames should be equal to the total number of frames presents
in the video.
n_clips : number of clips that are used
n_clip_frames : number of frame contained in each clip
"""
#clean the buffer of activations
model.clean_activation_buffers()
optimz.zero_grad()
for i, data, target in enumerate(data_load):
#backward pass for each clip
for j in range(n_clips):
out = torch.log(model(data[:,:,(n_clip_frames)*(j):(n_clip_frames)*(j+1)]))
loss = F.nll_loss(out, target)/n_clips
loss.backward()
optimz.step()
optimz.zero_grad()
#clean the buffer of activations
model.clean_activation_buffers()Training loop with standard convolutions
def train_iter(model, optimz, data_load):
optimz.zero_grad()
for i, (data,_ , target) in enumerate(data_load):
out = torch.log(model(data))
loss = F.nll_loss(out, target)
loss.backward()
optimz.step()
optimz.zero_grad()The weights are loaded from the tensorflow models released by the authors, trained on kinetics.
Base models implement standard 3D convolutions without stream buffers.
| Model Name | Top-1 Accuracy* | Top-5 Accuracy* | Input Shape |
|---|---|---|---|
| MoViNet-A0-Base | 72.28 | 90.92 | 50 x 172 x 172 |
| MoViNet-A1-Base | 76.69 | 93.40 | 50 x 172 x 172 |
| MoViNet-A2-Base | 78.62 | 94.17 | 50 x 224 x 224 |
| MoViNet-A3-Base | 81.79 | 95.67 | 120 x 256 x 256 |
| MoViNet-A4-Base | 83.48 | 96.16 | 80 x 290 x 290 |
| MoViNet-A5-Base | 84.27 | 96.39 | 120 x 320 x 320 |
| Model Name | Top-1 Accuracy* | Top-5 Accuracy* | Input Shape** |
|---|---|---|---|
| MoViNet-A0-Stream | 72.05 | 90.63 | 50 x 172 x 172 |
| MoViNet-A1-Stream | 76.45 | 93.25 | 50 x 172 x 172 |
| MoViNet-A2-Stream | 78.40 | 94.05 | 50 x 224 x 224 |
**In streaming mode, the number of frames correspond to the total accumulated duration of the 10-second clip.
*Accuracy reported on the official repository for the dataset kinetics 600, It has not been tested by me. It should be the same since the tf models and the reimplemented pytorch models output the same results [Test].
I currently haven't tested the speed of the streaming models, feel free to test and contribute.
Currently are available the pretrained models for the following architectures:
- MoViNetA1-BASE
- MoViNetA1-STREAM
- MoViNetA2-BASE
- MoViNetA2-STREAM
- MoViNetA3-BASE
- MoViNetA3-STREAM
- MoViNetA4-BASE
- MoViNetA4-STREAM
- MoViNetA5-BASE
- MoViNetA5-STREAM
I currently have no plans to include streaming version of A3,A4,A5. Those models are too slow for most mobile applications.
I recommend to create a new environment for testing and run the following command to install all the required packages:
pip install -r tests/test_requirements.txt
You can export the trained models in TorchScript or ONNX.
# Load a pretrained model or train your own
model = MoViNet(_C.MODEL.MoViNetA0, causal = False, pretrained = False)
model_scripted = torch.jit.script(model) # Export to TorchScript
model_scripted.save('model_scripted.pt') # Save
# Input to the model
x = torch.randn(1, 3, 16, 172, 172, requires_grad=True)
# Export the model
torch.onnx.export(model, # model being run
x, # model input (or a tuple for multiple inputs)
"movinet_A0.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=15, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['videos'], # the model's input names
output_names = ['outputs'], # the model's output names
)You may now use the ONNX model to be exported to OpenVINO IR, TensorRT, or Tensorflow/TFlite
@article{kondratyuk2021movinets,
title={MoViNets: Mobile Video Networks for Efficient Video Recognition},
author={Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Matthew Brown, and Boqing Gong},
journal={arXiv preprint arXiv:2103.11511},
year={2021}
}