Building Style Transfer Algorithms using the VGG-19 Neural Network

Frameworks used: Python, Tensorflow, Keras, Git

Introduction

Ars longa, vita brevis. We implement in this project neural style transfer to transform random landscape images to mimic the artistic style of well-known impressionist painters Claude Monet and Erin Hanson.

Two approaches are investigated to accomplish this: the first following Gatys et al.’s original approach as outlined in their 2015 paper introducing neural style transfer in order to build from scratch our own style transfer algorithm using VGG-19’s intermediate convolutional layers and the Tensorflow library’s various linear algebra functions, allowing us to dive deeper into the loss functions and gradient descent optimization underlying neural style transfer, and the second leveraging the convolutional layers of a pre-trained neural artistic stylization network originally proposed by Ghiasi et al. and made available through the Tensorflow library’s model hub.

As it is important to formally define some of the different inputs of our neural style transfer system in order to gain a deeper understand of its inner workings, we define three image types in our system: a starting “content” image, representing our starting image, such as a picture of a random landscape, that we want to transform to mimic a certain style; a starting “style” image, representing the artistic style that we want our content image to mimic, such as an original painting created by an artist; and an intermediate output “combination” image, representing a new image created at each intermediate stage of our gradient descent process and combining feature elements of our content and style images, with the final combination image being our output image.

In both approaches we take two different views of Central Park’s famous Bow Bridge as our content images, and impressionist artist Erin Hanson’s Layered Light and Claude Monet’s Bridge over a Pond of Water Lillies and Pathway in Garden at Giverny as our style images as shown in Figures 1 and 2 respectively:

**Figure 1.** Starting content images of Bow Bridge in Central Park used in both our bootstrapped and pre-trained neural style approaches

**Figure 2.** Starting style images of Erin Hanson’s Layered Light and Claude Monet’s Bridge over a Pond of Water Lillies and Pathway in Garden at Giverny

Methodology

One of the key insights underlying neural style transfer as proposed by Gatys et al. is that while classic machine learning systems implement gradient descent to iteratively update model weights in the direction that minimizes a specific loss function with the objective of (hopefully) finding a global optima for those model weights, neural style transfer can adapt this gradient descent process to instead update the input image’s pixel values by passing these through the early convolutional layers of a pre-trained model such as VGG-19 in order to extract content and style feature tensors or activations that are then used to iteratively update the system’s loss function and pixel gradients.

As there exists a style-content tradeoff at each step of our gradient descent process whereby updating our pixel values to resemble more our style image should incidentally in doing so lower their resemblance to our content image, this loss function should therefore appropriately define and weight both style and content losses computed from our combination image in order to optimize this tradeoff. To calculate these style and content loss functions, Gatys et al. creatively introduce the approach of using a pre-trained convolutional neural net’s early layers to compute these content and style activations given these first model layers’ functions as complex feature extractors that have learned to recognize intermediate-level image patterns, such as edges, corners, etc across a large number of different objects.

In our case we use the highly performant VGG-19 classification model originally trained and tested on ImageNet’s 1.2M image training set split across 1,000 classes and originally used by Gatys et al. given the high generalizability of this model across many different object classes should be expected to produce relatively high-quality feature maps on a wide array of random images passed to it. We can verify this generalizability by passing our first content image shown in Figure 1 to the model, outputting top-5 predictions of castle, palace, church, monastery and lakeside closely corresponding to the rather cathedral-like San Remo building in the background and the Central Park lake visible in the foreground.

**Figure 3.** Starting content image passed to VGG-19 full model

VGG-19 Object Predicted	Confidence Score
Castle	28%
Palace	22%
Church	9%
Monastery	5%
Lakeside	4%

Table I. VGG-19 predicted object classes and confidence scores of content image

The question now becomes: which model layers should be used to return the most informative feature maps? Gatys et al. answer this question by testing different combinations of VGG-19’s layers 1-5 to produce both content and style feature maps and empirically show that content feature maps can be best represented using the second convolution of VGG-19’s layer 5 while the first convolutions of layers 1-5 are most apt for producing style feature maps; these layers are therefore leveraged in our model implementation for extracting content and style maps respectively.

In order to better visualize the various filter patterns included in VGG-19’s early layers, Figures 4-7 display visual references of random filters taken from VGG-19’s layers 2-5 as provided by the highly educational Convnet Playground. As expected, these visualizations display increasing filter pattern complexity as we move deeper into the VGG-19 architecture and the model progressively learns increasingly complex patterns leveraging the previous layer’s extracted feature maps as inputs:

**Figure 4.** VGG-19 Block 1 Convolution 1 Sample Filter Visualizations (source: ConvNet Playground)

**Figure 5.** VGG-19 Block 1 Convolution 1 Sample Filter Visualizations (source: ConvNet Playground)

**Figure 6.** VGG-19 Block 1 Convolution 1 Sample Filter Visualizations (source: ConvNet Playground)

**Figure 7.** VGG-19 Block 1 Convolution 1 Sample Filter Visualizations (source: ConvNet Playground)

As these early layers can therefore be expected to perform well in quantifying the degree of feature similarity between pixel arrays, Gatys et al. outline different equations to reduce these style and content activations to single scalar values to be then inputted in a combined loss function, as detailed in the code shown below. It should be noted that while the content loss of two images can be calculated leveraging the unaltered values of the content activation array returned by passing our combination image to the second convolution of VGG-19’s layer 5, style loss calculations require factoring the correlations between different features in different layers, which can be accomplished by taking the Gram matrices of our style feature tensors in each layer in order to quantify these style similarities. The following equation averaging over all feature maps the sum of Gram matrices i.e. dot products of vectorized feature maps i and j in layer l can therefore be leveraged in quantifying our style features :

This gram matrix equation can be implemented leveraging the below gram_matrix() method using Tensorflow’s linalg.einsum() function. This method is attached to a StyleContentModel class implementing our truncated VGG model and transforming our content and style feature maps for use in our content and style loss functions using the call() method:

class StyleContentModel(tf.keras.models.Model):
  def __init__(self, style_layers, content_layers):
    super(StyleContentModel, self).__init__()
    self.vgg = vgg_layers(style_layers + content_layers)
    self.style_layers = style_layers
    self.content_layers = content_layers
    self.num_style_layers = len(style_layers)
    self.vgg.trainable = False

  def gram_matrix(self, input_tensor):
    result = tf.linalg.einsum('bijc,bijd->bcd', input_tensor, input_tensor)
    input_shape = tf.shape(input_tensor)
    num_locations = tf.cast(input_shape[1]*input_shape[2], tf.float32)
    return result/(num_locations)

  def call(self, inputs):
    "Expects float input in [0,1]"
    inputs = inputs*255.0
    preprocessed_input = tf.keras.applications.vgg19.preprocess_input(inputs)
    outputs = self.vgg(preprocessed_input)
    style_outputs, content_outputs = (outputs[:self.num_style_layers],
                                      outputs[self.num_style_layers:])
    style_outputs = [gram_matrix(style_output)
                     for style_output in style_outputs]
    content_dict = {content_name: value
                    for content_name, value
                    in zip(self.content_layers, content_outputs)}
    style_dict = {style_name: value
                  for style_name, value
                  in zip(self.style_layers, style_outputs)}

    return {'content': content_dict, 'style': style_dict}

extractor = StyleContentModel(style_layers, content_layers)
results = extractor(tf.constant(content_image))

We can then leverage the content and style feature maps returned by calling our extractor object on our input image to compute our style and content losses using the below style_content_loss() function. As can be gleaned in the code, in the case of our style loss we are summing the matrix mean squared errors for each of our five layers’ style and combined image Gram matrices returned by our extractor object, while similarly for our content loss we are performing one mean squared error computation on the single set of activations returned by our extractor for our content and combined images.

def style_content_loss(outputs, content_targets, style_targets, style_weight = 1e-2, content_weight = 1e4):
    style_outputs = outputs['style']
    content_outputs = outputs['content']
    style_loss = tf.add_n([tf.reduce_mean((style_outputs[name]-style_targets[name])**2) 
                           for name in style_outputs.keys()])
    print([name for name in style_outputs.keys()])
    style_loss *= style_weight / num_style_layers

    content_loss = tf.add_n([tf.reduce_mean((content_outputs[name]-content_targets[name])**2) 
                             for name in content_outputs.keys()])
    print([name for name in content_outputs.keys()])
    content_loss *= content_weight / num_content_layers
    loss = style_loss + content_loss
    return loss

style_targets = extractor(style_image)['style']
content_targets = extractor(content_image)['content']
image = tf.Variable(content_image)

One additional loss component to be added to our total loss function concerns limiting the number of high frequency artifacts produced in our combined image. We can limit these using a standard total variation loss (“TVL”) that acts as a regularization term on the high frequency components of an image:

def high_pass_x_y(image):
  x_var = image[:, :, 1:, :] - image[:, :, :-1, :]
  y_var = image[:, 1:, :, :] - image[:, :-1, :, :]
  return x_var, y_var

def total_variation_loss(image):
  x_deltas, y_deltas = high_pass_x_y(image)
  return tf.reduce_sum(tf.abs(x_deltas)) + tf.reduce_sum(tf.abs(y_deltas))

With this high variation loss function implemented we can then create our gradient descent function using a standard Adam optimizer. This functions computes our style and content losses as well as total variation loss at each step in order to calculate and apply gradient updates to our pixel values using an Adam optimizer:

total_variation_weight=30
opt = tf.optimizers.Adam(learning_rate=0.02, beta_1=0.99, epsilon=1e-1)

@tf.function()
def train_step(image):
  with tf.GradientTape() as tape:
    outputs = extractor(image)
    loss = style_content_loss(outputs)
    loss += total_variation_weight*tf.image.total_variation(image)

  grad = tape.gradient(loss, image)
  opt.apply_gradients([(grad, image)])
  image.assign(clip_0_1(image))

Altogether this stepwise gradient descent process can therefore be summarized as:

Style Loss: Pass input image to first convolutions of deep layers 1-5 of our VGG-19 model to extract feature maps to calculate style loss
Compute Gram matrices of style feature maps from (1) and compute matrix mean squared error versus target style image Gram matrix to calculate style loss
Multiply style loss by style loss weight and add to our total loss
Content Loss: Pass input image to the second convolution of deep layers 5 of our VGG-19 model to extract feature map to calculate content loss
Compute matrix mean squared error of feature map resulting from (4) versus original content image to calculate content loss
Multiply content loss by content loss weight and add to our total loss
Total Variation Loss: Compute total variation loss of our image by using a standard regularization term on high frequency artifacts in our image
Multiply TVL by TVL weight and add to our total loss
Gradient Updates: Calculate gradients of our pixel values in the direction that reduces total loss computed from (8) and update pixel values
Repeat steps (1) – (9) until convergence

Results

Applying this process using Central Park’s Bow Bridge and Erin Hanson’s Layered Light as our content and style images respectively produces the below output, interactively displaying our pixel gradient updates in real time:

**Figure 8**. Visualization output of applying gradient descent process to content image pixel values

For our second approach, we simply use a pre-trained neural artistic stylization network originally proposed by Ghiasi et al. and made available through Tensorflow’s model hub in order to visualize differences in their model outputs, producing the below results shown in Figures 9 -11.

hub_model = hub.load('https://tfhub.dev/google/magenta/arbitrary-image-stylization-v1-256/2')

for style_image in style_images:
  print('{} style image used'.format(style_image))
  for content_image in content_images:
    content_image_ = load_img(content_image)
    style_image_ = load_img(style_image)
    start_time = time.time()
    stylized_image = hub_model(tf.constant(content_image_), tf.constant(style_image_))[0]
    stylized_image = tensor_to_image(stylized_image)
    display(stylized_image)

Given Ghiasi et al. implement their solution using a different model backbone by training two separate style prediction and style transfer networks on a corpus of 80,000 images, the result of passing our same content and style images to their model is expectedly quite different than from leveraging our bootstrapped model version as can be seen in Figure 9:

**Figure 9.** Content, style and output images of second neural style transfer approach using Ghiasi et al.’s pre-trained style transfer network

**Figure 10.** Content, style and output images of second neural style transfer approach using Ghiasi et al.’s pre-trained style transfer network

**Figure 11.** Content, style and output images of second neural style transfer approach using Ghiasi et al.’s pre-trained style transfer network

Conclusion

In this project we implemented neural style transfer following two distinct approaches of (i) building our algorithm from scratch using Tensorflow’s linear algebra functions and following Gatys et al.’s original approach to neural style transfer using convolutional layers of the VGG-19 network and (ii) using the Tensorflow’s model hub’s pre-trained neural artistic stylization network proposed by Ghiasi et al. in order to compare the differing outputs of both implementations.

Our next steps in our explorations would be to experiment with using different VGG-19 convolutional layers to extract our feature maps as well as to test using the layers of different classification models also trained on ImageNet such as Resnet-50 and EfficientNet-B5.

Thanks for reading!