## Automatic differentiation in pqR

*2019-07-06 at 5:39 pm* *
1 comment *

I’ve released a version of my pqR implementation of R that has extensions for automatic differentiation. This is not a stable release, but it can be downloaded from pqR-project.org — look for the test version at the bottom — and installed the same as other pqR versions (from source, so you’ll need C and Fortran compilers). |

Note that this version very likely has various bugs — mostly showing up only if you use automatic differentiation, I hope.

You can read about the automatic differentation facilities here, or with help(Gradient) after installing the test version. Below are a few examples to show a bit of what you can do.

You can get derivatives (gradients) using the new `with gradient` construct. Here’s a simple example:

> with gradient (abc=7) abc^2+5 [1] 54 attr(,"gradient") [1] 14

The `with gradient` construct creates a new local environment with a variable `abc`, which is initialized to 7. The expression `abc^2+5` is evaluated in this enviroment, and the result is returned along with its derivative with respect to `abc`, attached as a `gradient` attribute.

The expression whose gradient is found needn’t be so simple. Here’s another example:

> f <- function (x) { + s <- 0 + for (i in 1:length(x)) s <- s + i*x[i] + s + } > vec <- c(3,1,9) > with gradient (vec) f(vec) [1] 32 attr(,"gradient") [,1] [,2] [,3] [1,] 1 2 3 > with gradient (vec) f(vec^2) [1] 254 attr(,"gradient") [,1] [,2] [,3] [1,] 6 4 54

Here, `with gradient (vec)` initializes the `vec` variable to the value of the existing variable of that name (same as if `(vec=vec)` had been written).

The output of `with gradient` matches what the existing `numericDeriv` function gives:

> numericDeriv (quote (f(vec^2)), "vec") [1] 254 attr(,"gradient") [,1] [,2] [,3] [1,] 6 4 54

However, `with gradient` computes the derivatives exactly (apart from roundoff error), whereas `numericDeriv` computes an approximation using finite differences.

The output of both `with gradient` and `numericDeriv` in this example is a “Jacobian” matrix, with one row in this case, since the value of `f(vec^2)` is a scalar, and three columns, giving the derivatives with respect to the three elements of `vec`.

When finding the gradient with respect to many variables (or vector elements), `numericDeriv` may be very slow, since getting numerical derivatives with respect to N variables requires at least N+1 evaluations of the expression. In contrast, `with gradient` evaluates the expression once, either computing gradients as it goes (called “forward mode” differentiation) or recording how the value was computed and later finding the gradient by “backpropagation” (also called “reverse mode” diffferentiation). The pqR implementation tries to choose which of these modes is best, though at present forward mode is always used for some operations for which reverse mode hasn’t been implemented yet.

Neural network training with multiple hidden layers is one application where reverse mode differentiation is crucial. Here’s a function to compute the output of a neural network with two hidden layers, given an input vector `x` and network parameters in `L`:

nnet <- function (x, L) { h1 <- tanh (L$b1 + x %*% L$W1) h2 <- tanh (L$b2 + h1 %*% L$W2) as.vector (L$b3 + h2 %*% L$W3) }

If the two hidden layers both have 10 units, so there are 100 weights in L$W2, naive forward mode differentiation would record a 10-by-100 element Jacobian matrix associated with `h2`. But reverse mode differentiation starting from the final scalar output value only needs to compute Jacobian matrices with a single row (and up to 100 columns), as it works backwards.

The current automatic differentiation implementation in pqR manages to achieve this automatically (though it doesn’t yet when some other operations are done).

Here’s how this network function could be used for gradient descent training to minimize squared error when predicting responses in a training set, using a network with `n1` and `n2` units in the hidden layers:

train <- function (X, y, n1, n2, iters, step, sd=0.1) { # Initialize parameters randomly. L <- list() n0 <- ncol(X) L$b1 <- rnorm (n1,sd=sd) L$W1 <- matrix (rnorm(n0*n1,sd=sd), n0, n1) L$b2 <- rnorm (n2,sd=sd) L$W2 <- matrix (rnorm(n1*n2,sd=sd), n1, n2) L$b3 <- rnorm (1,sd=sd) L$W3 <- rnorm (n2,sd=sd) # Train for 'iters' iterations to minimize squared # error predicting y. for (i in 1:iters) { # Find gradient of squared error (summed over all # training examples) with respect to the parameters. r <- with gradient (L) { e <- 0 for (i in 1:nrow(X)) { o <- nnet (X[i,], L) e <- e + (y[i]-o)^2 } e } g <- attr(r,"gradient") if (i %% 100 == 0) cat ("Iteration", i, ": Error", round(r,4), "\n") # Update parameters to reduce squared error. L$b1 <- L$b1 - step * as.vector (g$b1) L$W1 <- L$W1 - step * as.vector (g$W1) L$b2 <- L$b2 - step * as.vector (g$b2) L$W2 <- L$W2 - step * as.vector (g$W2) L$b3 <- L$b3 - step * as.vector (g$b3) L$W3 <- L$W3 - step * as.vector (g$W3) } L }

Note that when the gradient is with respect to a list (`L` here), the gradient is a list of the corresponding Jacobian matrices (which here have just one row).

Here’s code to use this `train` function to learn an example function of two inputs, based on 100 slightly noisy examples:

set.seed(1) truef <- function (X) cos (2*sqrt(X[,1]*X[,2])) - 2*(0.4-X[,1])^2 N <- 100 X <- cbind (runif(N), runif(N)) y <- truef (X) + rnorm(N,sd=0.01) print (system.time (L <- train (X, y, 10, 10, 30000, 0.001)))

The 30000 training iterations (each looking at all 100 training cases) take 30 seconds on my computer.

The result is pretty good, as seen from the contour plots below:

You can get the source for this and other examples from a repository of pqR automatic differentiation examples.

I’ll be talking about automatic differentiation for pqR at the RIOT workshop held in Toulouse on July 11, in conjunction with UseR! 2019.

Entry filed under: Computing, Machine Learning, R Programming, Statistics, Statistics - Computing.

1.Liu Wucutin | 2019-08-06 at 7:12 amRecently I have learning your great work — Bayesian Neural Network, it’s so beautiful and powerful. When I get the result through Turing and Flux(two libraries in Julia language), I’m deeply obsessed. Thank you very much, Professor, best wishes.