A learnable visual grammar for generating paintings

Rudolf Arnheim was an art and film theorist whose work, such as “Art and Visual Perception: A Psychology of the Creative Eye” (1954) and “Visual Thinking” (1969), investigated art through the lens of science and sensory perception. Our computational creativity work uses computers as tools for generating visual 'art' in a way that is inspired by Arnheim's formalism.

There has been a recent burgeoning of open-source code for generating images guided by Dual Encoders, fueled by evaluations from OpenAI's CLIP model which was trained on 400M images and their associated text (titles, descriptions or tags) from the internet. Here we provide an open-source version of our generative architecture called 'Arnheim', which can be optimised to produce images in the tradition of painting or drawing (i.e. discrete macro-pixel mark making rather than pixel-level image generation) using a hierarchical neural rewrite system. The generated paintings are evaluated using CLIP, similar to other approaches to generate ‘art’. Thanks to the grammar, we have greater access to and control over the generative process of producing a painting. We also gain insights into the generative capabilities of neural architectures, because by watching the videos of a particular architecture learning to produce a painting, one can appreciate its expressive capabilities (and learning biases). Further control can be achieved by adding computational aesthetic loss functions to the CLIP loss, allowing us to integrate our work into a literature which stretches back to Birkhoff’s work on aesthetic measures in 1933. We hope that by providing these two Colabs, others will be able to easily invent their own generative architectures and paintings.

The Arnheim Method

An outline of the Arnheim architecture is shown below, and described in our paper. It consists of a set of learnable input tokens called the input sequence (shown in yellow). Each token will end up describing a whole sequence of strokes. Tokens are independent of each other (they are processed in the batch dimension). Each token has 3 parts. A position specification which determines where that sequence of strokes should appear, a controller part which determines which of the downstream LSTMs (a kind of neural network) should interpret the current token, and an embedding part which provides learned input to the ith LSTM at the 0th level: LSTM(0,i). Each token is copied num_steps times, e.g. 6 times is shown below, and is input into the first layer of LSTMs (layer 0). The LSTMs subsequently output their own sequence of tokens (shown in green) for each input token. The original x,y “where” specification is added to that output and the various parts of the output are scaled (to produce the blue tokens). In some cases the scales can also be learnable, but by setting them by hand greater control can be achieved over the style of the painting.

In the batch dimension, each of the 6 blue tokens is copied to produce another set of e.g 6 tokens each, and the same process takes place at the next level of LSTMs, see below…

This means that one token at the top level produces 6 tokens at the middle level which produces 6x6 tokens at the lowest level. Each of these 36 tokens is then interpreted as a stroke in one of two possible ways. In Arnheim 1.0 the stroke specifies a homomorphic transform to impose on the final line to be drawn (i.e. a displacement, rotation and scaling), as well as the thickness and colour of the line. In Arnheim 2.0 the stroke specifies the displacement and the thickness and colour of a Bezier curve. In this way, a single top level token and the parameters of the LSTMs encode the “what” and “where” properties of a sequence of strokes which have a systematic relationship to each other determined by the parameters of the LSTM and the input token. This is the key to imposing structure on the painting.

Generative Architectures

The latest Arnheim_2 Colab provides a set of architectures, Arnheim2, Photographic, DPPN, and SEQTOSEQ which all have different ways of producing the marks to be optimised by CLIP. “Photographic” uses a direct encoding of strokes as used in CLIPDraw, except that the thickness and colour and a set of stroke modifiers can also come from one of 3 respective LSTMs, providing a mildly greater coherence to the painting (e.g. see the painting of Flemish still life with steak and tulips below). DPPN uses a simple feed forward neural network but with a residual connection direct from the input x,y position specification to the output position specification, which allows a clear “what”/”where” division in stroke description, allowing learning to more easily modify “what” without modifying “where”, and vice versa. SEQTOSEQ is a simple encoder and decoder LSTM, with the top level LSTM reading the sequence and outputting a final hidden state which is sent to the decoder LSTM which generates the stroke description tokens, again with a residual connection from inputs positions to output positions providing the crucial what/where division. These architectures fall along a spectrum of photorealistic to abstract as you can see below (the apple on the right is produced by Arnheim2 with only 2 input tokens. The left image is produced by a more direct encoding of strokes (photographic). Note, how the style changes with different encoding schemes.

The legacy Colab shows a simplified version of Arnheim 1.0 which was described here: but which uses only one GPU and uses evolution instead of CLIP gradients to optimise the parameters. This is much more inefficient than using gradients, but the Colab is included for completeness and allows you to produce much more general (non-differentiable) generators if you wish.

These two Colabs are examples of how AI can be used to augment human creativity by suggesting possible ways of forming a depiction. Alva Noë defines art as the process of reorganisation of experience, effectively as a kind of visual philosophy. Whilst we are still very far from an algorithmic understanding of this deeply human process, the colabs here show that to some extent one minor aspect can be understood by synthesis, namely, how decisions are made about which ordered marks to make to produce a depiction efficiently. We hope that others will modify these algorithms in fascinating ways.


The following pictures show example outputs of each of the architectures.

Flemish still life with steak and tulips generated with Arnheim 1.0 above (500 GPU version), and Arnheim 2.0 (single GPU version) below
"A chicken" with 200 input vectors; Below: "A chicken" with only one input vector but a longer sequence length
A wild seascape in the style of late John Constable painting
A photorealistic chicken
A red coral


No items found.