The M Tank 编辑了一份报告《A Year in Computer Vision》，记录了 2016 至 2017 年计算机视觉领域的研究成果，对开发者和研究人员来说是不可多得的一份详细材料。虽然该文已经过去一年多的时间了，但是考虑到研究成果由理论到落地的滞后性，里面的很多东西现在反而能够读出新味道。
Neural Enhance 是Alex J. Champandard的创意，结合四篇不同研究论文的方法来实现超分辨率方法。
Amortised MAP Inference for Image Super-resolution 提出了一种使用卷积神经网络计算最大后验（MAP）推断的方法。但是，他们的研究提出了三种优化方法，GAN在其中实时图像数据上表现明显更好。
2.Style Transfer集中体现了神经网络在公共领域的新用途，特别是去年的Facebook集成以及像Prisma 和Artomatix 这样的公司。（Prisma. Available: https://prisma-ai.com/ [Accessed: 01/04/2017].Artomatix. Available: https://services.artomatix.com/ [Accessed: 01/04/2017].）风格转换是一种较旧的技术，但在2015年出版了一个神经算法的艺术风格转换为神经网络。从那时起，风格转移的概念被Nikulin和Novak扩展，并且也被用于视频，就像计算机视觉中其他的共同进步一样。
Super-resolution, Style Transfer & Colourisation
Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space. Last year’s advancements in Super-resolution, Style Transfer & Colourisation occupied that space for us.
Super-resolution refers to the process of estimating a high resolution image from a low resolution counterpart, and also the prediction of image features at different magnifications, something which the human brain can do almost effortlessly. Originally super-resolution was performed by simple techniques like bicubic-interpolation and nearest neighbours. In terms of commercial applications, the desire to overcome low-resolution constraints stemming from source quality and realisation of ‘CSI Miami’ style image enhancement has driven research in the field. Here are some of the year’s advances and their potential impact:
- Neural Enhance is the brainchild of Alex J. Champandard and combines approaches from four different research papers to achieve its Super-resolution method.
- Real-Time Video Super Resolution was also attempted in 2016 in two notable instances.,
- RAISR: Rapid and Accurate Image Super-Resolution from Google avoids the costly memory and speed requirements of neural network approaches by training filters with low-resolution and high-resolution image pairs. RAISR, as a learning-based framework, is two orders of magnitude faster than competing algorithms and has minimal memory requirements when compared with neural network-based approaches. Hence super-resolution is extendable to personal devices. There is a research blog available here.
Figure 7: Super-resolution SRGAN example
Note: From left to right: bicubic interpolation (the objective worst performer for focus), Deep residual network optimised for MSE, deep residual generative adversarial network optimized for a loss more sensitive to human perception, original High Resolution (HR) image. Corresponding peak signal to noise ratio (PSNR) and structural similarity (SSIM) are shown in two brackets. [4 x upscaling] The reader may wish to zoom in on the middle two images (SRResNet and SRGAN) to see the difference between image smoothness vs more realistic fine details.
Source: Ledig et al. (2017)
The use of Generative Adversarial Networks (GANs) represent current SOTA for Super-resolution:
- SRGAN provides photo-realistic textures from heavily downsampled images on public benchmarks, using a discriminator network trained to differentiate between super-resolved and original photo-realistic images.
Qualitatively SRGAN performs the best, although SRResNet performs best with peak-signal-to-noise-ratio (PSNR) metric but SRGAN gets the finer texture details and achieves the best Mean Opinion Score (MOS). “To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.” All previous approaches fail to recover the finer texture details at large upscaling factors.
- Amortised MAP Inference for Image Super-resolution proposes a method for calculation of Maximum a Posteriori (MAP) inference using a Convolutional Neural Network. However, their research presents three approaches for optimisation, all of which GANs perform markedly better on real image data at present.
Figure 8: Style Transfer from Nikulin & Novakle
Note: Transferring different styles to a photo of a cat (original top left).
Source: Nikulin & Novak (2016)
Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma and Artomatix. Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style. Since then, the concept of style transfer was expanded upon by Nikulin and Novak and also applied to video, as is the common progression within Computer Vision.
Figure 9: Further examples of Style Transfer
Note: The top row (left to right) represent the artistic style which is transposed onto the original images which are displayed in the first column (Woman, Golden Gate Bridge and Meadow Environment). Using conditional instance normalisation a single style transfer network can capture 32 style simultaneously, five of which are displayed here. The full suite of images in available in the source paper’s appendix. This work will feature in the International Conference on Learning Representations (ICLR) 2017.
Source: Dumoulin et al. (2017, p. 2)
Style transfer as a topic is fairly intuitive once visualised; take an image and imagine it with the stylistic features of a different image. For example, in the style of a famous painting or artist. This year Facebook released Caffe2Go, their deep learning system which integrates into mobile devices. Google also released some interesting work which sought to blend multiple styles to generate entirely unique image styles: Research blog and full paper.
Besides mobile integrations, style transfer has applications in the creation of game assets. Members of our team recently saw a presentation by the Founder and CTO of Artomatix, Eric Risser, who discussed the technique’s novel application for content generation in games (texture mutation, etc.) and, therefore, dramatically minimises the work of a conventional texture artist.
Colourisation is the process of changing monochrome images to new full-colour versions. Originally this was done manually by people who painstakingly selected colours to represent specific pixels in each image. In 2016, it became possible to automate this process while maintaining the appearance of realism indicative of the human-centric colourisation process. While humans may not accurately represent the true colours of a given scene, their real world knowledge allows the application of colours in a way which is consistent with the image and another person viewing said image.
The process of colourisation is interesting in that the network assigns the most likely colouring for images based on its understanding of object location, textures and environment, e.g. it learns that skin is pinkish and the sky is blueish.
Three of the most influential works of the year are as follows:
- Zhang et al. produced a method that was able to successfully fool humans on 32% of their trials. Their methodology is comparable to a “colourisation Turing test.”
- Larsson et al. fully automate their image colourisation system using Deep Learning for Histogram estimation.
- Finally, Lizuka, Simo-Serra and Ishikawa demonstrate a colourisation model also based upon CNNs. The work outperformed the existing SOTA, we [the team] feel as though this work is qualitatively best also, appearing to be the most realistic. Figure 10 provides comparisons, however the image is taken from Lizuka et al.
Figure 10: Comparison of Colourisation Research
Note: From top to bottom - column one contains the original monochrome image input which is subsequently colourised through various techniques. The remaining columns display the results generated by other prominent colourisation research in 2016. When viewed from left to right, these are Larsson et al. 84 2016 (column two), Zhang et al. 83 2016 (Column three), and Lizuka, Simo-Serra and Ishikawa. 85 2016, also referred to as “ours” by the authors (Column four). The quality difference in colourisation is most evident in row three (from the top) which depicts a group of young boys. We believe Lizuka et al.’s work to be qualitatively superior (Column four).
Source: Lizuka et al. 2016
“Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN.”
In a test to see how natural their colourisation was, users were given a random image from their models and were asked, "does this image look natural to you?"
Their approach achieved 92.6%, the baseline achieved roughly 70% and the ground truth (the actual colour photos) were considered 97.7% of the time to be natural.