neural-networks

Search IconIcon to open search

Source: google-ml-course

Neural networks

  • Goal: we want the model to learn the nonlinear features itself without us having to manually define the nonlinear features
  • More sophisticated version of feature crosses
  • Automatic learning of the feature crosses

Linear model

Each input (feature) is assigned a weight and the weight combination of all these features result in the output.

Incorporate nonlinearity

Add another layer of features?

  • not linear yet
  • linear combination of linear functions is still linear
  • solution: do a nonlinear transform between layers (activation function)

Activation function

Sigmoid function

$$ F(x) = \dfrac{1}{1 + e^{-x}} $$

Rectified Linear Unit (ReLU)

Caps value (from the bottom) at 0.

$$ F(x) = \max (0, x)$$

  • Often works better than the sigmoid function (from empirical findings)
  • This is probably due to the responsiveness of the ReLU function compared to the sigmoid (flat gradient on extreme ends of the sigmoid)

Notes

(taken from class notes)

  • If no (nonlinear) activation functions are used:
    • in the case of a FCN, all hidden layers are parametrised by weights and biases (input to output is mapped by an affine function $y = Wx + b$)
    • if no nonlinearity is present between each mapping, the whole network then reduces to an affine mapping between the input and the output
    • so no nonlinear activations effectively render all hidden layers useless, reduces the neural newrok to a one layer perceptron $$ x_L = Wx_0 + b$$
  • Alternatively, if activation function $\phi$ available but no hidden layers: $$x_L = x_1 = \phi( W_1x_0 + b_1 )$$
    • network is not able to approximate a general nonlinear I/O mapping
    • activation function $\phi$ is often monotone increasing, and this has no effect for maximising the network output in classification

Training the neural network

  • Due to the incorporated nonlinearities and layers: non-convex optimisation
  • Initialisation may be important
  • Perform backpropagation (variant of gradient descent for non-convex optimisation) to train the weights

Notes

  • The greater the number of neurons in a single hidden layer, the greater the redundancy
    • No redundancy: neuron is ‘lost’ during a run
    • With redundancy: increased likelihood of converging to a good model
  • The more layers: the more complexity because of more ‘crossings’ between intermediate features –> result in the modeling of more complex topology
  • It might make sense to have more nodes in the first layer to provide more base features for crossing in the next layer
  • A first hidden layer with only one neuron can never be used to train a good model, no matter how deep the layers are.
    • The output of this first hidden layer will then only vary along one dimension
    • Later layers are not able to compensate for this
  • Caution: overfitting with models which are too complex and too deep