#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Introduction To Python

How to leave/exit/deactivate a Python virtualenvironment Exception Type: JSONDecodeError at /update/ Exception Value: Expecting value: line 1 column 1 (char 0) How to choosing the right estimator for the machine learning problem? Why not getting result of a as [4, 7,8,3,2] Getting error in user defined function How to implement Queue in python using list? 'collections.OrderedDict' object has no attribute 'status_code' What is use of Heat map ? How to plot heat map? Join Discussion

4 (4,001 Ratings)

220 Learners

Neha Kumawat

5 months ago

Global and Local Minima | insideaiml

In
my previous article "Optimizers in Machine Learning and Deep Learning".
I gave a brief introduction about Adagrad optimizers. In this article, I will
try to give an in-depth explanation of the optimizer’s algorithm.

If
you didn’t read my previous article. I recommend you to first go through my
previous articles on optimizers mentioned below and then come back to this
article for better understanding:

Let’s
start…

Therefore, it is well suited when dealing with sparse data..

But the same update rate may not fit all parameters. For example, some parameters may have reached a stage where only fine adjustment is required, but some parameters need to be adjusted significantly due to the small number of matching samples.

Adagrad raised this issue, an algorithm that offers different learning rates at different parameters between them. What this means is that for each parameter, as its total distance updated increases, its learning rate is also slow.

GloVe word embedding uses **Adagrad **where rare words need major updates and common words need a little update.

Adagrad
removes the need to manually tune the learning rate.

There
are mainly three problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate in the late training period is very small.
- it needs to be set by doing the initial global learning rate.

Let’s
see how it works

In this algorithm, we try to change the learning rate (alpha) for each
update. The learning rate changes during each update as it will decrease if the weight is significantly updated in the short term again and it will increase if the weight is not significantly updated.

First,
each weight has its own cache value, which collects the squares of
the gradients up to the current point.

Cache updating for Adagrad | insideaiml

The cache value will continue to increase as training continues. Now a new update formula can be provided as mentioned below:

Adagrad Update Formula | insideaiml

The above formula is the same as the original gradient descent formula except that
here the learning rate (alpha) constantly changes throughout the training
process. The E in the denominator which is shown in the above formula is a very small value which helps us to ensure that the division by zero does not occur.

Essentially
what’s happening here is that if a weight has been having **very huge
updates**, its cache value is also going to **increase**.
As a result, the learning rate will be lower and the size of the weight update will decrease over time.

On the other hand, if a weight has not been having any significant update, its
cache value is going to be very less, and hence its learning rate will increase,
forcing it to take bigger updates. This is the basic principle of the
Adagrad optimizer.

However, the disadvantage of this algorithm is that even if there are previous weight gradients, the cache will always increase by a certain amount because the square cannot be negative. Therefore the learning rate of all weights will eventually drop to a very low level until the training is less intense.

Adagrad
can be visualize as:

Adagrad | insideaiml

```
def adagrad():
weights, bais, eta = init_w, init_b, 0.1
v_w, v_b, eps = 0, 0, 1e-8
for i in range(max_epochs):
dw, db = 0, 0
for x,y in zip(X,Y):
dw += grad_w(weights, bais, x, y)
db += grad_b(weights, bais, x, y)
v_w = v_w + dw**2
v_b = v_b + db**2
weights_2 = weights - (eta / np.sqrt(v_w + eps)) * dw
bais_2 = bais - (eta / np.sqrt(v_b + eps)) * db
```

To overcome the problem of Adagrad there is many other optimizers algorithm available. One of
them is **Adadelta**.

In Adadelta, we do not need to set a default reading rate as we take the effective rate of past steps to the current gradient.

There are three major problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate during the late training period is very low.
- it needs to be set by doing the initial global learning rate.

Adadelta is an extension of Adagrad and also tries to reduce Adagrad's rate of learning, excessively.

It does this by limiting the gradient window that has been exceeded to a certain size *w*. Running average at time *t* then depends on the previous average and the current gradient.

In
Adadelta, we** **don't have to set the default learning rate as we take the ratio
of the running average of the previous time steps to the current gradient.

Let’s
see and understand how its work

In the Adadelta optimizer
algorithm, it will try not to accumulate all past squared gradients values. It
instead tries to restrict the window of accumulated past gradients to some
fixed size **(say w).**

Here, it Instead of
inefficiently storing w previous squared
gradients value, the sum of gradients is recursively defined as a decaying
average of all past squared gradients.

The running
average **E[g2]t** at time step** t** then depends (as a fraction** γ** similarly to the Momentum term) only on the
previous average and the current gradient value:

current gradient value | insideaiml

Next, we set **γ **to a similar value as the momentum term, say around
0.8. To be more specific, lets now rewrite our vanilla SGD update as shown in the
image below according to the parameter update vector Δθt:

update vector | insideaiml

The parameter update vector of
Adagrad that we derived previously can also be written as shown below:

vector of Adagrad | insideaiml

Now we can simply
replace the diagonal matrix** Gt** with the decaying average over past squared
gradients **E[g2]t **as shown below:

past squared gradients | insideaiml

Now as the denominator
is just the root mean squared (RMS) error of the gradient, we can replace it
with the criterion short-hand as shown below:

root mean squared (RMS) error of the gradient | insideaiml

Note: The units in this update
(as well as in SGD, Momentum, or Adagrad) are incompatible, meaning that the update must have the same assumptions as of the parameter. To realize this, we have to first
define another exponentially decaying average, this time not of squared
gradients but of squared parameter updates. It is shown below:

squared gradients | insideaiml

The **root mean squared error**
of parameter updates can be given by as follows:

root mean squared error | insideaiml

Since **RMS[Δθ]t **is unknown, we approximate it with the RMS of
parameter updates until the previous time step. Replacing the learning rate** η **in the previous update rule with **RMS[Δθ]t−1**. Finally, the Adadelta update rule can be
given as shown below:

RMS of parameter updates | insideaiml

Note: With Adadelta, we do not even need to
set a default learning rate, as it has been eliminated from the update rule.

```
# Adadalta
def adadelta(params, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for param, sqr, delta in zip(params, sqrs, deltas):
g = param.grad / batch_size
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
# update weight
param[:] -= cur_delta
```

Adadelta | insideaiml

I hope after reading this article, finally, you came to know **what is Adagrad and Adadelta optimizers algorithms and their importance**.
In the next articles, I will come with a detailed explanation of some other types of optimizers.** **For more blogs/courses on data science, machine learning,
artificial intelligence and new technologies do visit us at **InsideAIML.**

Thanks for reading…

We're Online!

Chat now for any query