Note: Neural Turning Machine

Most elegent design I’ve ever seen…

Mundane Turing Machine

A machine in every move, read a symbol from the current position of its head on a tape, from its state register and the symbol read to find the next instruction from its table, then based on the instruction, write a symbol with head, update its state register, and chooses to move left/right or halt.

Reference: See wikipedia.

Question:
Wouldn’t turing machine in N-Dimentional Space to be interesting…the head could move in an arbitary direction…


Overall Archtecture

Difference:
A neural network is used to replace the mundane table, just like what the Deep Q-learning did. So this is basically a Q-learning model with a differertiable memory bank!


Memory Operations

In compensation to that, the memory need to be be differentiable to update by gradient descent.

Define memory to be a $M\times N$ Matrix $M_t$. It is seemed as a memory with $N$ entries, and each entry have a capacity of $M$.

Read Operation:
The read head emits a softmaxed vector $w_t$ of length $N$, the read $1 \times M$ vector $r_t$ would be
$$r_t=\sum_i^N{w_t(i) \times M_t(i)}$$
very intuitive.

Write Operation:
Write head emits a softmaxed weighting vector $w_t$ to specify which entries of matrix will be modified.

For erase, the write head emits a softmaxed erase vector $e_t$, and erase contents by the two weightings:
$$\dot M_{t+1}(i) = M_t(i)*(1-w_t(i)e_t(i))$$

For writing, the write head emmits a addition vector $a_t$ to add to the selected contents:
$$M_{t+1}(i)=\dot M_t(i) + w_t(i)a_t(i)$$


Addressing

The afore-mentioned weighting vectors are used to operate selected entries, but to select the entries in a diffenertiable way, we need two addressing mechanisms, namely: Content-based Addressing and Location-based Addressing.

Overall Addressing Diagram

Content-based Addressing

It allows the network to select entries by stating its content, so that it could update it. The head emmits a vector $k_t$ length of $M$, which is compared to each entries to get a content similarity vector $w^c_t$ of length $N$
$$w^c_t(i) = \frac{k_t \cdot M(:,i)}{||k_t||\cdot||M(:,i)||} $$
. This weighting is then normalized by softmax to further attenuate smaller values and amplify larger values.

Location-based Addressing

It allows the network to get the memory of a location without knowing what’s inside.

First, each head need to provide a scalar interpolation gate $g_t$ to specify which type of addressing is to be used in order to get the current memory:
$$w^g_t = g_t w^c_t + (1-g_t) w_{t-1}$$

In terms of location-based addressing, there are three possible actions, namely to shift: $-1,0,1$ entry to the right, same as a mundane turing machine. This is controlled by a length $3$ softmaxed vector $s_t$ provided by the head. other than softmax, using a shift scalar emitted by the controller as lower bound to simply clip the outputs is also experimented.

In order to perform the selected shift, a circular convolution is performed on memory $M_t$ by performing that on its index, namely, $w^g_t$, basically rolling the weighting vector $w^g_t$ by $s_t$.
$$\dot w_t = \sum_{j=0}^{N-1}w^g_t(j)s_t(i-j)$$

Sharpening

Inevitably the rolled weighting vector $w_t^g$ will be blurry due to the diffenertiable nature of the roll. A sharpening operation using a scalar $\mathcal{Y}_t \geqslant 1$ emmitted by the head.
$$w_t(i) = \frac{\dot{w_t}(i)^{\mathcal{Y}_t}}{
\sum{\dot{w_t}(i)^{\mathcal{Y}_t}}
}$$


Controller Network

Both feed-forward neural network and LSTM are considered to be used as controller. However, due to the fact that currently only one reading/writing head is used, simple operations like multiplying two numbers in memory is made impossible for a feed-forward network. Thus those units of recurrent nature would works better.


Reference

Neural Turing Machines 2014. Alex Graves, Greg Wayne & Ivo Danihelka