The 34th British Machine Vision Conference (BMVC)

20th - 24th November 2023, Aberdeen

These are some of my notes and observations from the conference. I am sure they contain many errors and inaccuracies.

Keynote Digest

Maja Pantic (META, Imperial College London) - Faces, Avatars & Gen AI

The primary topic was on the realistic behavior of avatars, particularly within the context of Facebook’s digital twin project. Maja’s work centers on enhancing the authenticity of avatar behavior. Unlike existing commercial tools that prioritize emotional responses and model performance on mobile devices, this approach leads to a reliance on 2D imagery and GAN-based models. A key takeaway from the presentation was the exploration of various loss functions. Specifically, their approach involves tracking the head position, eye movements, and action units (indicative of facial muscle activity) during different emotional states.

Georgia Gkioxari (exMETA, Caltech) - The Future of Recognition is 3D

The first post emphasized efficient methods and small datasets, contrasting with the second post’s focus on big data. Both, however, shared a common perspective on the critical role of loss functions in Deep Learning. This was particularly evident in tasks like 3D object detection, where the challenge is to identify a block and its orientation in a three-dimensional scene from a mere 2D photo. Their approach includes innovative strategies in the loss function design. Notable among these is the integration of 2D object detection as an auxiliary task and the inclusion of a parameter, denoted as \(\mu\), to account for uncertainty in 3D predictions. The resulting loss function is formulated as \(L = L_{2D} + e^{-\mu} L_{3D} + \mu\)., ingeniously combining these elements.

Michael Pound (University of Nottingham) - Embracing Plant Research:

Michael Pound is known for his contributions to the popularization of computer science, especially through his Computerphile YouTube channel. In his talk, he focused on underappreciated but still important interdisciplinary research. The first part of his talk detailed methods for segmenting high-resolution CT images. The second part focused on the use of adaptive optics in microscopy, specifically the use of diffusion models to correct for aberrations. This approach is based on a notable study entitled Cold Diffusion, which demonstrates the generality of diffusion models in dealing with various types of transformations beyond Gaussian noise, such as blur and masking. Accordingly, they model the aberration as a combination of Zernike polynomials from three differently focused images.

Daniel Cremers (TuM) - 3D CV for Dynamic Scene Understanding

This talk was also divided into two parts, but both seem to be motivated by the challenges of autonomous driving. The first topic includes several articles and views on Simultaneous Localization and Mapping (SLAM). The speaker argues that deep learning should be used primarily for tasks where traditional methods are not accurate enough or too computationally intensive. For example, in her paper Deep Virtual Stereo Odometry, if you have a monocular camera, you need a deep learning model to incorporate prior knowledge about the environment, how big cars, roads, houses, and so on are typically. Or in their article Behind the Scenes, where they predict not just a depth map, but an entire density field, meaning that you estimate not only how far the car is from you, but also where it ends.

A few papers distantly related to my topic

Open-world Text-Specified Object Counting

How to learn CLIP to count an arbitrary class? Just stack the CLIP embedding of the image and the embedding of the textual description of the object. Then mix them using an attention block followed by a decoder and finally sum the resulting density map. Learned in a supervised way using the FSC147 dataset, but it can predict unknown classes.

Infinite class Mixup

The authors of this paper show that target fusion is worth doing before the last linear layer of the neural network and in combination with contrastive learning.

Train ViT on Small Dataset With Translation Perceptibility

To improve the robustness of ViT to translation, the authors proposed a self-supervised companion learning method. Here, the input image is translated before being fed into the network, and the supervised classification loss is accompanied by the translation prediction.

GestSync: Determining who is speaking without a talking head

The goal of this work is to recognize the person speaking based on body movements alone, without seeing the lips. They also use this idea for audio-visual synchronization and for determining who is speaking in a crowd of masked faces.

Protecting Publicly Available Data With Machine Learning Shortcuts

Neural networks tend to overfit to various unrelated information present in the image, such as timestamps, logos, etc., so called shortcuts. The novel idea of this work was to take advantage of this behavior and add almost invisible shortcuts to your images that you publish online. Therefore, it cannot be used as a training set for any task. Because if you do, your network will learn these shortcuts instead of generalization and won’t work on real images.

Convolution kernel adaptation to calibrated fisheye

Classical convolutional kernels are not ideal for fisheye cameras because the images suffer from distortion. The authors propose a special type of deformable convolutional network that reflects the fisheye distortion and shows better domain adaptation and high performance for depth estimation and semantic segmentation.

Links

  • EU collaborate to build a big tech competitive cluster ECCP, it can be used also for research purposes.
  • GCPR German Conference on Pattern Recognition GCPR, smaller conference, where already published articles can be submitted.
  • Weight initialization He et al. 2015 and Glorot and Bengio
  • DeepMind offers summer/winter student positions.

Summer schools