[A Neural Algorithm of Artistic Style] presented an interesting way to disentangle style of a picture from content of a picture.
Using the convolution layers of VGG-Network trained for classification, both styles and contents of images could already be well extracted.
Def 1:
Define two image have the same content if they have the same filter response at every layer.
In short:
$$Loss_{def1}(img_1,img_2)=\sum_{l=0}^n{||F_{img_1}^l-F_{img_2}^l||_2}$$
where $F_{img}^l$ is the feature maps of $img$ at $l^{th}$ layer.
Def 2:
Define two image have same style if they have the same matrix of correlation among filter responses at every layer
- Why define styles like this? Simple to remove spatial properties?
In short:
$$Loss_{def2}(img1,img2) = \sum_{n=1}^l{||corr(F_{img_1}^l)- corr(F_{img_2}^l)||_2}$$
Goal:
Find an image $X_{gen}$ that minimize $loss_{def1}(X_{content}, X_{gen})$ using the gradient desent where $X_{content}$ be the image providing layout of the result and $X_{gen}$ be the resultant image starting from white noise.
At the same time, $X_{gen}$ need to minimize $loss_{def1}(X_{style}, X_{gen})$, where now $X_{style}$ is the image of a painting of some sort used to provide only style.
Let $X_{gen}$ be our final generated image.
- Why use white noise?
Reference
A Neural Algorithm of Artistic Style
Gatys, L.~A. and Ecker, A.~S. and Bethge, M.
2015
[bibtex]