This post is about a paper published in ICCV2015, called “Multi-view Convolutional Neural Networks for 3D Shape Recognition”. It describes a method to classify 3d shape models using 2d image classification networks. While the authors have open-sourced their matlab implementation on GitHub, here I’ll try to implement this network with Caffe.
In the first half of this two-part blog, I’ll quickly explain the core idea of this paper. After that I’ll try to implement a naive version of this network. While in part II I’ll go through the details on implementing MVCNN as the paper describes.
The complete codes and scripts in this blog can be found at my GitHub repo.
(The network architecture of MVCNN)
Traditional 3d shape recognition algorithms are generally based on heuristic descriptors such as Spherical Harmonics. More recent advances like ShapeNets tried to voxelize the model and train a deep neural network. On the other hand, MVCNN tries to leverage the power of image classification CNNs, because public image datasets such as ImageNet is much larger than 3d model datasets and state-of-the-art networks on ilsvrc have achieved pretty high precision on classification tasks.
So how about rendering a 3d shape model under different viewpoints and training a 2d CNN with rendered images? Then you can input the rendered views of an unknown model and try to decide its category. This is exactly what we’ll implement in this post. We’ll use multiple rendered images as input and simply do a majority vote to decide the final label for the model.
However, the MVCNN is a bit more complicated in the way it combines multi-view representations. The authors train each view with a different network and introduce a view-pooling layer to combine multiple networks into one, as can be seen on the figure above. View-pooling at its core is simply max-pooling, extracting the largest value at each pixel among all views. More on this topic next time, but let’s first go ahead and implement the simple one :).
- Caffe :white_check_mark:
- Python2.7(Anaconda2) :white_check_mark:
- Scikit-Image :white_check_mark:
You can download the rendered images in their repository. Here I’ll use the modelnet40v2 dataset. On the other hand, you can directly download the models from Princeton ModelNet and render by yourself if you want better rendering qualities. Although the authors have argued that rendering method is irrelevant with classification precisions.
Please follow the steps described here to preprocess the dataset. Basically all it does is to first pad all images into 256x256, and split a validation set out of training set with ratio 1:9. Then we prepare label text files and compile images into leveldb files to feed into the network. You can sample the inputs if you think the full dataset is too large. I use leveldb rather than lmdb because lmdb seems to have bug with large amount data. If you’re not familiar with this procedure, please check out this tutorial.
Here we’ll try to fine-tune the bvlc-reference-caffenet that caffe provides. The method is pretty much the same as the official tutorial on fine-tuning flickr style data. There are a few things to note. First, because we are using the 40 class version of modelnet so the output of this network should be 40. And I’ve fixed the learning rate for all conv layers. Moreover, the latest fc layer’s learning rate is set 5 times higher. Check the prototxt in my repo for details.
I’ve tried several parameter setups and the best could give around 72.5% test accuracy. Go play with the parameters by yourself.
We’ll now use 80 views as input and take the majority of the predictions as label for this model. The file classify_model_simple.py contains the source code for this method:
The output accuracy is about 78.3%, which is higher than one image classification output.
Great, this simple network does what we’ve expected, although the result is much more inferior to the result achieved in the paper. But this isn’t a surprise because it is image loss rather than model loss that is minimized during training. This also explains the high training error because some views are quite different from others so this adds a lot of noise in our data.
I’ll implement the full mvcnn with view-pooling in the next post, and see if it works better. Stay tuned.
Su H, Maji S, Kalogerakis E, et al. Multi-view convolutional neural networks for 3d shape recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 945-953.