All transformations learned by deep neural networks can be reduced to a handful of tensor operations that are applied to tensors of numeric data. In the First example, the network was built by stacking dense layers on top of each other. A layer instance looked like:
layer_dense(units = 512, activation = "relu")
This layer can be interpreted as a function, which takes as input a 2D tensor and returns another 2D tensor, a new representation for the input tensor. Specifically, the function is as follows (where W
is a 2D tensor and b
is a vector, both attributes of the layer):
output = relu(dot(W, input) + b)
There are three tensor operations above:
- a dot product (
dot
) between the input tensor and a tensor namedW
- an addition (
+
) between the resulting 2D tensor and a vectorb
- a
relu
operation, whererelu(x)
ismax(x, 0)
.
Element-wise operations
The relu
operation and addition are element-wise operations: operations that are applied independently to each entry in the tensors being considered. This means these operations are highly amenable to massively parallel implementations (vectorised implementations, a term that comes from the vector processor supercomputer architecture from the 1970-1990 period). A naive R implementation of an element-wise relu
operation:
naive_relu <- function(x){
for(i in nrow(x))
for (j in ncol(x))
x[i, j] <- max(x[i, j], 0)
x
}
In practice, when dealing with R arrays, these operations are available as well-optimised built-in R functions, which themselves delegate the heavy lifting to a BLAS implementation (Basic Linear Algebra Subprograms) if you have one installed (which you should). BLAS are low-level, highly parallel, efficient tensor-manipulation routines typically implemented in Fortran or C. In R the following are native element-wise operations that are very fast:
z <- x + y
z <- pmax(z, 0)
Operations with tensors of different dimensions
The R sweep()
function enables operations between higher-dimension tensors and lower-dimension tensors. A matrix plus vector addition can be performed as follows:
sweep(x, 2, y, `+`)
The second argument (2
) specifies the dimensions of x
over which to sweep y
. The last argument (+
) is the operation to perform during the sweep, which should be a function of two arguments: x
and an array of the same dimensions generated from y
by aperm()
.
You can apply a sweep in any number of dimensions and can apply any function that implements a vectorised operation over two arrays. The following example sweeps a 2D tensor over the last two dimensions of a 4D tensor using the pmax()
function:
# random values with shape (64, 3, 32, 10)
x <- array(round(runif(1000, 0, 9)), dim = c(64, 3, 32, 10))
# value of 5s of shape (32, 10)
y <- array(5, dim = c(32, 10))
# output has shape of x
z <- sweep(x, c(3, 4), y, pmax)
Tensor dot
The dot operation, also called a tensor product (not to be confused with an element-wise product) is the most common and useful tensor operation. Contrary to element-wise operations, it combines entries in the input tensors.
An element-wise product is performed with the *
operator in R, whereas dot products use the %*%
operator:
z <- x %*% y
In mathematical notation, you'd note the operation with a dot (.
). The dot product of two vectors x
and y
is computed as follows:
naive_vector_dot <- function(x, y){
z <- 0
for (i in 1:length(x))
z <- z + x[[i]] * y[[i]]
z
}
Note that the dot product between two vectors is a scalar and that only vectors with the same number of elements are compatible.
It is also possible to take the dot product between a matrix x
and a vector y
, which returns a vector where the elements are the dot products between y
and the rows of x
.
naive_matrix_vector_dot <- function(x, y){
z <- rep(0, nrow(x))
for (i in 1:nrow(x))
for (y in 1:ncol(x))
z[[i]] <- z[[i]] + x[[i, j]] * y[[j]]
}
Note that as soon as one of the two tensors has more than one dimension, %*%
is no longer symmetric, which is to say that x %*% y
isn't the same as y %*% x
.
The dot product generalises to tensors with an arbitrary number of axes.
It is possible to take the dot product of two matrices x
and y
(x %*% y
) if and only if ncol(x) == nrow(y)
. The result is a matrix with shape (nrow(x), ncol(y))
, where the coefficients are the vector products between the rows of x
and the columns of y
.
Tensor reshaping
Reshaping a tensor means rearranging its rows and columns to match a target shape. Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor. Reshaping is best understood via simple examples:
x <- matrix(0:5, nrow = 3, byrow = TRUE)
array_reshape(x, dim = c(6, 1))
array_reshape(x, dim = c(2, 3))
A special case of reshaping that is commonly encountered is called transposition, which means exchanging its rows and columns, such that x[i, ]
becomes x[, i]
. The t()
function can be used to transpose a matrix.
Geometric interpretation of tensor operations
The contents of tensors can be interpreted as being coordinates of points in some geometric space and thus all tensor operations have a geometric interpretation. For example the vector A = [0.5, 1.0]
is a point in 2D space. Now consider a new point B = [1, 0.25]
, which we will add to A
. This can be done geometrically by chaining together the vectors, with the resulting location being the vector representing the sum of A
and B
.
In general, elementary geometric operations such as affine transformations, rotations, scaling, and so on can be expressed as tensor operations.
A geometric interpretation of deep learning
Neural networks consist entirely of chains of tensor operations and that all of these tensor operations are just geometric transformations of the input data. It follows that you can interpret a neural network as a very complex geometric transformation in a high-dimensional space, implemented via a long series of simple steps.
The following mental image may prove useful: imagine two sheets of coloured paper, one red and one blue that are on top of each other. We then crumple them together into a small ball and this crumpled paper ball can represent our input data and each sheet of paper is a class of data in a classification problem. What a neural network (or any other machine-learning model) is meant to do, is to figure out a transformation of the paper ball that would uncrumple it, so that we can separate the two classes. The series of simple transformations is what deep learning is performing.
Uncrumpling paper balls is what machine learning is about: finding neat representations for complex, highly folded data manifolds. Deep learning excels at this because it takes the approach of incrementally decomposing a complicated geometric transformation into a long chain of elementary ones, which is akin to the strategy a human would follow to uncrumple a paper ball. Each layer in a deep network applies a transformation that disentangles the data a little and a deep stack of layers makes tractable an extremely complicated disentanglement process.