I was trying to understand the following result more intuitively (for linearly separable data):

$$ k leq frac{R^2}{gamma_g^2}$$

where:

k = is the number of mistakes the perceptron algorithm does

R = is the distance at which the data points are confined to.

$gamma_g = text{the smallest geometric margin for the current data set i.e.} = min_{1 leq i leq n}{ gamma(i) }$

$gamma(i) = text{the perpendicular geometric distance from the decision boundary for data i .}$

The way I wanted to understand better is, I was thinking of assessing the level of difficulty of a data set according to the margin they could have or the distance from the origin they had. i.e. by the number of mistakes the algorithm could make. i.e. by looking at how big/small each upper bound is different for each training data set.

So for example, I currently think of this result as the following:

if we can achieve a large geometric margin (i.e. if there exists a large geometric margin) then we can have higher confidence of our classifier, thus, for large margins, the denominator for the above result will be larger, which means that the number of mistakes (k) will decrease, which in turn means that, if a higher margin is possible, then the classification task is "easier". Which makes sense, if there exists a decision boundary that can clearly distinguish +'s from -'s then, the classification task should be considered "easier".

However, I was having difficulties coming up with a similar explanation for the numerator. Obviously, the larger the numerator is, from the above equation, the classification task might have more mistakes (so one can consider the task "harder" for the current data set). However, I was not really sure why R has that effect on the number of mistakes nor did I have a good intuition on why R affect the perceptron the way it does.

NOTE:

I know the number of mistakes perceptron makes depends on the initialization when it starts cycling. However, that is not what I care about. I care about just arguing the intuition for the number of mistakes soley on the upper bound of the perceptron. How to initilize it is a disscussion for a different quesiton.

**Contents**hide

#### Best Answer

(My answer is with regard to the single-layered perceptron, as described in wikipedia.)

Let $d$ be the dimension of a feature vector, including the dummy component for the bias (which is the constant $1$).

Then you can think of the learning of the perceptron as an exploration of $mathbb R^d$, in search for an appropriate weights vector $winmathbb R^d$ (which defines a hyperplane that distinguishes between the positive and negative examples).

Equivalently, you can think of the learning as an exploration of the space of hyperplanes, in order to find an appropriate one.

If the maximum margin is smaller, then the area in $mathbb R^d$ that contains appropriate weights vectors is smaller. Smaller areas are easier to miss while exploring, and so in the worst case, it would take the perceptron more time to find this smaller area in $mathbb R^d$.

In the image you can see that the appropriate hyperplanes are quite limited, i.e. the area of appropriate hyperplanes (in the hyperplanes space) is small:

If the maximum margin is larger, then the area in $mathbb R^d$ that contains appropriate weights vectors is larger. Bigger areas are harder to miss while exploring, and so in the worst case, it would take the perceptron less time to find this bigger area in $mathbb R^d$.

In the image you can see that the appropriate hyperplanes are less limited, i.e. the area of appropriate hyperplanes (in the hyperplanes space) is bigger:

If $R$ is smaller, then the general area in $mathbb R^d$ that the perceptron would explore is smaller, as it won't bother looking in places which are clearly irrelevant.

In the image you can see that for a smaller $R$ there is a bigger area of clearly irrelevant hyperplanes, which leaves a smaller area in the hyperplanes space for the perceptron to explore:

Similarly, if $R$ is larger, then the perceptron can rule out a smaller area in $mathbb R^d$. That leaves a bigger area to explore, which would take more time to cover.

Finally, I wrote a perceptron for $d=3$ with an animation that shows the hyperplane defined by the current $w$. I think that visualizing the way it learns from different examples and with different parameters might be illuminating.

`import numpy as np import matplotlib.pyplot as plt from matplotlib.animation import FuncAnimation import itertools import time EXAMPLES1 = np.array([[9, 0], [8, 0]], dtype='float') EXAMPLES2 = np.array([[5, 0], [6, 0]], dtype='float') # closer to the origin EXAMPLES_CLASSES1_2 = np.array([1, -1]) EXAMPLES3 = np.array([[20, 26], [16, -11], [21, 2], [18, 25], [26, -19], [11, 6], [20, -10], [17, 17]], dtype='float') EXAMPLES4 = EXAMPLES3 - 11 * np.ones(EXAMPLES3.shape) # closer to the origin EXAMPLES5 = np.array([[20, 26], [16, 27], [21, 24], [15, 25], [22, 23], [14, 27], [20, 20], [13, 24]], dtype='float') EXAMPLES_CLASSES3_4_5 = np.array([1, -1, 1, -1, 1, -1, 1, -1]) W0_0 = np.zeros(3) W0_1 = np.array([1., 0, -2]) LEARNING_RATE_1 = 0.00156 W0_2 = np.array([1., 0, -2]) / 0.00156 LEARNING_RATE_2 = 1 EXAMPLES = EXAMPLES1 EXAMPLES_CLASSES = EXAMPLES_CLASSES1_2 LEARNING_RATE = LEARNING_RATE_2 W0 = W0_2 INTERVAL = 3e2 INTERVAL = 1 INTERVAL = 3e1 N = len(EXAMPLES) assert(EXAMPLES.shape[1] == 2) assert(N == len(EXAMPLES_CLASSES)) assert(LEARNING_RATE > 0) assert(len(W0) == 3) fig, ax = plt.subplots() XS = EXAMPLES[:, 0] YS = EXAMPLES[:, 1] max_abs_x = max(max(XS), abs(min(XS)), 1) max_abs_y = max(max(YS), abs(min(YS)), 1) bottom_limit = -2 * max_abs_y top_limit = 2 * max_abs_y left_limit = -2 * max_abs_x right_limit = 2 * max_abs_x plt.xlim(left_limit, right_limit) plt.ylim(bottom_limit, top_limit) x_range_size = right_limit - left_limit y_range_size = top_limit - bottom_limit # plot the examples for x, y in zip(EXAMPLES, EXAMPLES_CLASSES): if y == 1: plt.plot(x[0], x[1], marker='o', markersize=4, color="green") else: plt.plot(x[0], x[1], marker='o', markersize=4, color="red") # add a dummy column of ones for the bias. EXAMPLES = np.hstack((EXAMPLES, np.ones((N, 1)))) hyperplane, = plt.plot([], [], color='blue') hyperplane.set_visible(False) k_text = plt.text(ax.get_xlim()[0] + x_range_size / 70, ax.get_ylim()[1] - y_range_size / 25, 'mistakes: 0') next_example_idx = 0 k = 0 w = W0 def get_hyperplane(w): w_x, w_y, bias = w if w_x == w_y == 0: print(f'No hyperplane is defined by w') return None if w_y == 0: x1 = x2 = -bias / w_x y1 = bottom_limit y2 = top_limit if not left_limit <= x1 <= right_limit: print(f'The hyperplane is out of range of our graph') return None else: print(f'The hyperplane defined by w is `x = {x1}`') else: # find the hyperplane representation as y = m * x + b. m = -w_x / w_y # m = -1 / (w_y / w_x) b = -bias / w_y # b = -bias / np.linalg.norm(np.array([w_x, w_y])) print(f'The hyperplane defined by w is `y = {m} * x + {b}`') if m == 0: y1 = y2 = m * left_limit + b x1 = left_limit x2 = bottom_limit else: potential_points_on_limits = { (left_limit, m * left_limit + b), (right_limit, m * right_limit + b), ((bottom_limit - b) / m, bottom_limit), ((top_limit - b) / m, top_limit)} for x, y in set(potential_points_on_limits): if not left_limit <= x <= right_limit: potential_points_on_limits.remove((x, y)) elif not bottom_limit <= y <= top_limit: potential_points_on_limits.remove((x, y)) if len(potential_points_on_limits) < 2: print(f'The hyperplane is out of range of our graph') return None x1, y1 = potential_points_on_limits.pop() x2, y2 = potential_points_on_limits.pop() return (x1, x2), (y1, y2) def update(frame): global k, w, next_example_idx for j in itertools.chain(range(next_example_idx, N), range(next_example_idx)): x = EXAMPLES[j] y = EXAMPLES_CLASSES[j] perdicted_y = 1 if np.dot(w, x) > 0 else -1 if perdicted_y != y: print(f'nMisclassification of example {x}') w += LEARNING_RATE * y * x k += 1 k_text.set_text(f'mistakes: {k}') print(f'This is mistake number {k}') print(f'Update w to {w}') hyperplane_xs_and_ys = get_hyperplane(w) if hyperplane_xs_and_ys: hyperplane.set_data(hyperplane_xs_and_ys) hyperplane.set_visible(True) else: hyperplane.set_visible(False) next_example_idx = j + 1 return hyperplane, k_text # The perceptron successfully predicted all of the examples. time.sleep(3) exit() ani = FuncAnimation(fig, update, blit=True, repeat=False, interval=INTERVAL) plt.show() `

### Similar Posts:

- Solved – Intuition on upper bound of the number of mistakes of the perceptron algorithm and how to classify different data sets as “easier” or “harder”
- Solved – Perceptrons and Decision Boundaries
- Solved – Difference between a SVM and a perceptron
- Solved – Can SVM be used for dimensionality reduction from n dimensions to n-1
- Solved – Can a perceptron be modified so as to converge with non-linearly separable data