Neural Networks

Содержание

Слайд 2

Pachshenko
Galina Nikolaevna
Associate Professor of Information System Department,
Candidate

Pachshenko Galina Nikolaevna Associate Professor of Information System Department, Candidate of Technical Science
of Technical Science

Слайд 3

Week 7
Lecture 7

Week 7 Lecture 7

Слайд 4

Topics
Types of Optimization Algorithms used in Neural Networks
Gradient descent

Topics Types of Optimization Algorithms used in Neural Networks Gradient descent

Слайд 5

Have you ever wondered which optimization algorithm to use for your Neural

Have you ever wondered which optimization algorithm to use for your Neural
network Model to produce slightly better and faster results by updating the Model parameters such as Weights and Bias values .
Should we use Gradient Descent or Stochastic gradient Descent? 

Слайд 6

What are Optimization Algorithms ?

What are Optimization Algorithms ?

Слайд 7

Optimization algorithms helps us to minimize (or maximize) an Objective function (another name for Error function) E(x) which is

Optimization algorithms helps us to minimize (or maximize) an Objective function (another
simply a mathematical function dependent on the Model’s internal learnable parameters which are used in computing the target values(Y) from the set of predictors(X) used in the model.

Слайд 8

For example — we call the Weights(W) and the Bias(b) values of the neural network as its internal

For example — we call the Weights(W) and the Bias(b) values of
learnable parameters which are used in computing the output values and are learned and updated in the direction of optimal solution i.e minimizing the Loss by the network’s training process and also play a major role in the training process of the Neural Network Model .

Слайд 9

The internal parameters of a Model play a very important role in

The internal parameters of a Model play a very important role in
efficiently and effectively training a Model and produce accurate results.

Слайд 10

This is why we use various Optimization strategies and algorithms to update

This is why we use various Optimization strategies and algorithms to update
and calculate appropriate and optimum values of such model’s parameters which influence our Model’s learning process and the output of a Model.

Слайд 11

Optimization Algorithm falls in 2 major categories

Optimization Algorithm falls in 2 major categories

Слайд 12

First Order Optimization Algorithms — These algorithms minimize or maximize a Loss function E(x) using its Gradient values

First Order Optimization Algorithms — These algorithms minimize or maximize a Loss
with respect to the parameters. Most widely used First order optimization algorithm is Gradient Descent.

Слайд 13

The First order derivative tells us whether the function is decreasing or

The First order derivative tells us whether the function is decreasing or
increasing at a particular point. First order Derivative basically give us a line which is Tangential to a point on its Error Surface.

Слайд 14

What is a Gradient of a function?

What is a Gradient of a function?

Слайд 15

A Gradient is simply a vector which is a multi-variable generalization of a derivative(dy/dx) which

A Gradient is simply a vector which is a multi-variable generalization of
is the instantaneous rate of change of y with respect to x. 

Слайд 16

The difference is that to calculate a derivative of a function which

The difference is that to calculate a derivative of a function which
is dependent on more than one variable or multiple variables, a Gradient takes its place. And a gradient is calculated using Partial Derivatives . Also another major difference between the Gradient and a derivative is that a Gradient of a function produces a Vector Field.

Слайд 17

A Gradient is represented by a Jacobian Matrix — which is simply a Matrix consisting of first order partial

A Gradient is represented by a Jacobian Matrix — which is simply
Derivatives(Gradients).

Слайд 18

Hence summing up, a derivative is simply defined for a function dependent

Hence summing up, a derivative is simply defined for a function dependent
on single variables , whereas a Gradient is defined for function dependent on multiple variables. 

Слайд 19

Second Order Optimization Algorithms — Second-order methods use the second order derivative which is also called Hessian to

Second Order Optimization Algorithms — Second-order methods use the second order derivative
minimize or maximize the Loss function.

Слайд 20

The Hessian is a Matrix of Second Order Partial Derivatives. Since the second derivative is costly

The Hessian is a Matrix of Second Order Partial Derivatives. Since the
to compute, the second order is not used much .

Слайд 21

The second order derivative tells us whether the first derivative is increasing or decreasing

The second order derivative tells us whether the first derivative is increasing
which hints at the function’s curvature.
Second Order Derivative provide us with a quadratic surface which touches the curvature of the Error Surface.

Слайд 22

Some Advantages of Second Order Optimization over First Order —
Although the Second Order

Some Advantages of Second Order Optimization over First Order — Although the
Derivative may be a bit costly to find and calculate, but the advantage of a Second order Optimization Technique is that is does not neglect or ignore the curvature of Surface. Secondly, in terms of Step-wise Performance they are better.

Слайд 23

What are the different types of Optimization Algorithms used in Neural Networks ?

What are the different types of Optimization Algorithms used in Neural Networks ?

Слайд 24

Gradient Descent
Variants of Gradient Descent: Batch Gradient Descent; Stochastic gradient descent; Mini Batch Gradient Descent

Gradient Descent Variants of Gradient Descent: Batch Gradient Descent; Stochastic gradient descent; Mini Batch Gradient Descent

Слайд 25

Gradient Descent is the most important technique and the foundation of how we

Gradient Descent is the most important technique and the foundation of how
train and optimize Intelligent Systems. What is does is —

Слайд 26

“Gradient Descent — Find the Minima , control the variance and then update the Model’s

“Gradient Descent — Find the Minima , control the variance and then
parameters and finally lead us to Convergence.”

Слайд 27

θ=θ−η⋅∇J(θ) 
— is the formula of the parameter updates, where ‘η’ is the learning rate ,’∇J(θ)’ is

θ=θ−η⋅∇J(θ) — is the formula of the parameter updates, where ‘η’ is
the Gradient of Loss function-J(θ) w.r.t parameters-‘θ’.

Слайд 28

The parameter η is the training rate. This value can either set

The parameter η is the training rate. This value can either set
to a fixed value or found by one-dimensional optimization along the training direction at each step. An optimal value for the training rate obtained by line minimization at each successive step is generally preferable. However, there are still many software tools that only use a fixed value for the training rate.

Слайд 29

It is the most popular Optimization algorithms used in optimizing a Neural

It is the most popular Optimization algorithms used in optimizing a Neural
Network. Now gradient descent is majorly used to do Weights updates in a Neural Network Model , i.e update and tune the Model’s parameters in a direction so that we can minimize the Loss function (or cost function).

Слайд 30

Now we all know a Neural Network trains via a famous technique

Now we all know a Neural Network trains via a famous technique
called Backpropagation , in which we first propagate forward calculating the dot product of Inputs signals and their corresponding Weights and then apply a activation function to those sum of products, which transforms the input signal to an output signal and also is important to model complex Non-linear functions and introduces Non-linearities to the Model which enables the Model to learn almost any arbitrary functional mappings.

Слайд 31

After this we propagate backwards in the Network carrying Error terms and updating Weights values using Gradient Descent, in

After this we propagate backwards in the Network carrying Error terms and
which we calculate the gradient of Error(E) function with respect to the Weights (W) or the parameters , and update the parameters (here Weights) in the opposite direction of the Gradient of the Loss function w.r.t to the Model’s parameters.

Слайд 33

The image on above shows the process of Weight updates in the

The image on above shows the process of Weight updates in the
opposite direction of the Gradient Vector of Error w.r.t to the Weights of the Network. The U-Shaped curve is the Gradient(slope). 

Слайд 34

As one can notice if the Weight(W) values are too small or too

As one can notice if the Weight(W) values are too small or
large then we have large Errors , so want to update and optimize the weights such that it is neither too small nor too large , so we descent downwards opposite to the Gradients until we find a local minima.

Слайд 35

Gradient Descent we descent downwards opposite to the Gradients until we find a local

Gradient Descent we descent downwards opposite to the Gradients until we find a local minima.
minima.

Слайд 36

 1.find slope 2. (x = x — slope) until slope=0

1.find slope 2. (x = x — slope) until slope=0

Слайд 38

1. find slope 2. alpha = 0.1 (or any number from 0

1. find slope 2. alpha = 0.1 (or any number from 0
to 1) 3. x = x — (alpha*slope) until slope=0

Слайд 41

Solving the problem

Solving the problem

Слайд 42

The next picture is an activity diagram of the training process with

The next picture is an activity diagram of the training process with
gradient descent. As we can see, the parameter vector is improved in two steps: First, the gradient descent training direction is computed. Second, a suitable training rate is found.

Слайд 43

The gradient descent training algorithm has the severe drawback of requiring many

The gradient descent training algorithm has the severe drawback of requiring many
iterations for functions which have long, narrow valley structures. Indeed, the downhill gradient is the direction in which the loss function decreases most rapidly, but this does not necessarily produce the fastest convergence. The following picture illustrates this issue.

Слайд 44

Gradient descent is the recommended algorithm when we have very big neural

Gradient descent is the recommended algorithm when we have very big neural
networks, with many thousand parameters. The reason is that this method only stores the gradient vector (size n), and it does not store the Hessian matrix (size n2).

Слайд 45

Optimization algorithm for Neural network Model

Annealing
Stochastic Gradient Descent
AW-SGD 
Momentum 
Nesterov Momentum 
AdaGrad
AdaDelta
ADAM
BFGS
LBFGS

Optimization algorithm for Neural network Model Annealing Stochastic Gradient Descent AW-SGD Momentum
Имя файла: Neural-Networks.pptx
Количество просмотров: 65
Количество скачиваний: 0