[Paper Review] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

2020. 8. 13. 15:12ㆍDeepLearning

Title: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Authors: Richard Zhang(1), Phillip Isola(1,2), Alexei A. Efros(1), Eli Shechtman(3), Oliver Wang(3)
Affiliation: UC Berkeley(1), OpenAI(2), Adobe Research(3)
Official Code: Link
Summary
- PROBLEM STATEMENT: Evaluate how well metrics correspond with human perceptual judgments.
  - Collect a large-scale perceptual similarity dataset
  - Deep features across training objectives outperform widely-used perceptual metrics (e.g., SSIM)
    - Show all types of deep features outperforms traditional metrics
  - Train new metric (LPIPS) on perceptual judgments
  - It has found that Alexnet is the best for perceptual similarity!
  - Show network architecture alone doesn't account for the performance
- Train new metric (LPIPS) on perceptual judgements
- Improve performance by calibrating feature response from a pre-trained network
  - Different levels of supervision (supervised, self-supervised, and unsupervised)을 비교
참고할만한 자료
- [https://www.youtube.com/watch?v=DglrYx9F3UU]
- [https://www.youtube.com/watch?v=VDeJFb5jt5M]

Introduction

"Perceptual losses"? What elements are critical for their success?

고전적인 접근법은 아래와 같다.

Per-pixel measures
- L2 Euclidean distance
- Peak Signal-to-Noise Ratio (PSNR)
- 일반적으로 per-pixel metric은 blurring을 초래한다.

Perceptual distance
- 두 이미지 사이의 유사성을 판단할 때, 사람의 인지 능력과 유사하도록 개발된 metric
- Examples: SSIM, MSSIM, FSIM, and HDR-VDP
- SSIM and FSIM were not designed to handle situations where *_spatial ambiguities *_are a factor.

하지만 인간이 갖고 있는 다양한 nuances를 담지 못한다는 단점이 있다. 인간의 인지(perceptual)은 매우 주관적이며, 추상적인 개념이다. 사람은 두 이미지의 유사성을 어떻게 판단할까?

예를 들어,

"빨간 원은 1)빨간 직사각형과 2) 파란 원, 둘 중 어느 것에 더 가까운가(유사한가)?" 라는 질문에 대한 답은 문맥에 따라 다르며 정량적으로 나타낼 수 없는 특성 때문에 intractable 하다.

최근 딥러닝 커뮤니티에서 internal activations of deep convolutional networks, 특히 VGG를 사용하면 다양한 상황에 대해서 "perceptual loss"로 사용 될 수 있음을 확인하였다.

Perceptual loss 에 대해서 탐구하면서 본 논문은 아래 질문들에 대하여 답을 구한다.

"perceptual losses"라는 것들이 실제 사람의 시각 인지와 부합하는가?
다른 metric들과 어떻게 비교할 것인가?
- 새로운 데이터셋 제시
Deep feature를 사용할 때, network의 구조가 중요한가?
- No!
  - VGG / Alex / Squeeze 를 비교해보면 네트워크 구조에 무관하게 인간의 시각 인지 능력과 부합한다.
어떤 task에서 네트워크를 학습하는 게 좋은가? 그리고 애초에 학습이 필요하긴 한가?
- An emergent property shared across networks, even across architectures and training signals
  - ImageNet classification 문제 외에도 다양한 방법이 가능하다.
  - self-supervised도 잘 동작한다.
    - BiGAN, channel prediction, and puzzle solving
  - 간단한 비지도학습 네트워크 초기화 방법 (ex. stacked k-means)이면 고전 perceptual loss보다 훨씬 잘 동작한다.
- Randomly initialized network는 매우 안 좋은 성능을 보인다. (당연한 것 아닌가...?)

Goal: Collect large-scale set of human perceptual judgments on distortions
Procedure: Sample a patch. Distort it twice. Ask human which is smaller.
Distortions for Train&Val:
(1)Traditional distortions noise, photometric, blur, warps, compression
(2)CNN-Based distortions Randomly generated denoising autoencoders by varying hyperparameters
Distortions for Val only:
(3) Real algorithms Outputs from superresolution, frame interpolation, video deblurring, colorization algorithmsJust Noticeable Differences

[Deep Learning] Conv1D에 관하여 (0)	2020.09.29
[Paper Review] RandAugment: Practical automated data augmentation with a reduced search space (0)	2020.08.28
Small Batch Size in Deep Learning (0)	2020.07.23

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`