** **

In maximum
likelihood and Bayesian parameter estimation, we treated supervised learning
under the assumption that the forms of the underlying density functions were
known. In most pattern recognition applications, the common parametric forms
rarely fit the densities actually encountered in practice. In particular, all
of the classical parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve multimodal densities. In this
chapter, we shall examine nonparametric procedures *that can be used with
arbitrary distributions *and *without the assumption that the forms of the
underlying densities are known*.

Figure 11.1: The classification of non-parametric methods in pattern recognition.

There are several
types of nonparametric methods of interest in pattern recognition (Figure 11.1
and Figure 11.2). One consists of procedures for estimating the density
functions *p*(**x**|*w _{j}*) from sample patterns. If
these estimates are satisfactory, they can be substituted for the true
densities when designing the classifier. Another consists of procedures for
directly estimating the posterior probabilities

Figure 11.2: Non-parametric density estimation.

** **

The basic ideas
behind many of the methods of estimating an unknown probability density
function are very simple, and rely on the fact that the probability *P *that
a vector **x** will fall in a region *R *is given by

** _{}
**
(11.1)

Thus *P *is a
smoothed or averaged version of the density function *p*(**x**),* *and
we can estimate this smoothed value of *p* by estimating the probability *P.
*Suppose that *n *samples **x**_{1},….,**x*** _{n}
*are drawn independently and identically distributed (i.i.d.) according to
the probability law

** _{} **
(11.2)

where

** _{} **
(11.3)

and the expected
value for *k *is

** _{} **
(11.4)

Moreover, this
binomial distribution for *k *peaks very sharply about the mean, so that
we expect that the ratio *k/n *will be a very good estimate for the
probability *P, *and hence for the smoothed density function. This
estimate is especially accurate when *n *is very large. If we now assume
that *p*(**x**)* *is continuous and that the region *R. *is so small that p does not vary
appreciably within it, we can write

** _{}
**
(11.5)

where **x** is a
point within *R*, *n* is the total
number of samples, *k *is the number of samples whose probability density
function is to be estimated, and *V *is the volume enclosed by *R *Combining eqs.11.1,11.4, and 11.5, we
arrive at the following obvious estimate for *p*(**x**),

** _{}
**
(11.6)

There are several problems that remain-some
practical and some theoretical. If we fix the volume *V *and take more and
more training samples, the ratio *k/n *will converge (in probability) as
desired, but then we have only obtained an estimate of the space-averaged value
of *p*(**x**),

** _{}
**
(11.7)

If we want to
obtain *p*(**x**)* *rather than just an averaged version of it, we
must be prepared to let *V *approach zero. However, if we fix the number *n
*of samples and let *V *approach zero, the region will eventually
become so small that it will enclose no samples, and our estimate *p*(**x**)@0 will be useless.
Or if by chance one or more of the training samples coincide at **x**, the
estimate diverges to infinity, which is equally useless.

From a practical
standpoint, we note that the number of samples is always limited. Thus, the
volume *V *cannot be allowed to become arbitrarily small. If this kind of
estimate is to be used, one will have to accept a certain amount of variance in
the ratio *k/n *and a certain amount of averaging of the density *p*(**x**).

From a theoretical
standpoint, it is interesting to ask how these limitations can be circumvented
if an unlimited number of samples is available. Suppose we use the following
procedure. To estimate the density at **x**,* *we form a sequence of
regions *R*_{1},* R*_{2},…, containing **x**-the first
region to be used with one sample, the second with two, and so on. Let *V _{n
}*be the volume of

** _{}
**
(11.8)

where

_{}

If *p _{n}*(

**1.**_{} : assures that the space averaged *P*/*V
*will converge to *p*(**x**),* *if the size of the regions
reduced uniformly and that *p*(.) is continuous at **x **

_{} : only makes
sense if *p*(**x**)¹0, and assures that the frequency ratio will converge (in
probability) to the probability *P.*** **

_{} : necessary
if *p _{n}*(

**(a)**

**(b)**

Figure 11.3: Non-parametric methods.

There are two
common ways of obtaining sequences of regions that satisfy these conditions (Figure
11.3). One is to shrink an initial region by specifying the volume *V _{n}
*as some function of

The Parzen-window
approach to estimating densities can be introduced by temporarily assuming
that. the region *R _{n}* is a

** _{} **
(11.9)

as shown in Figure 11.4.

Figure 11.4: The hypercube defined by the region *R.*

We can obtain an
analytic expression for *k _{n}*,

** _{}
**
(11.10)

as shown in Figure 11.5.

Figure 11.5: The kernel function in two dimensions*.*

Thus, *j*(**u**) defines a unit hypercube centered at the origin. It
follows that *j*((**x - x*** _{i}*)/

_{}**
**
(11.11)

as shown in Figure 11.6, and when we substitute this into eq.11.8 we obtain the estimate

_{}**
**
(11.12)

Figure 11.6: The calculation of number of samples within the kernel*.*

Equation 11.12
suggests a more general approach to estimating density functions. Rather than
limiting ourselves to the hypercube window function of eq.11.10, suppose we
allow a more general class of window functions. In such a case, eq.11.12
expresses our estimate for *p*(**x**)* *as an average of functions
of **x** and the samples **x*** _{i}*. In essence, the window
function is being used for

It is natural to
ask that the estimate *p _{n}*(

*j*(**u**) ³ 0** **
(11.13)

and

** _{}
**
(11.14)

and if we maintain
the relation *V _{n}*

Let us examine the
effect that the *window width h _{n} *has on

** _{}
**
(11.15)

then we can write *p _{n}*(

** _{}
**
(11.16)

Because *V _{n}*

Figure 11.7: The effect of Parzen-Window width *h _{n}* on

*If* *h _{n }is very large*, then the amplitude of

** _{} **
(11.17)

Thus, as *h _{n}
*approaches zero,

The choice of *h _{n}*
(or

Figure 11.8: The effect of window width on the density function estimate.

In discussing
convergence, we must recognize that we are talking about the convergence of a
sequence of random variables, because for any fixed **x** the value of *p _{n}*(

_{}**
**
(11.18)

and

_{} **
**
(11.19)

Figure 11.9: The effect of Parzen-window width *h _{n}* on the
estimated density.

To prove
convergence we must place conditions on the unknown density *p*(**x**),*
*on the window function *j*(**u**), and on the window
width *h _{n}. *In general, continuity of

_{}
**
**
(11.20)

_{} **
**
(11.21)

_{}**
**
(11.22)

and

_{}**
**
(11.23)

Equations 11.20 and
11.21 keep *j*(.) well behaved, and they are satisfied by most density functions
that one might think of using for window functions. Equations 11.22 and 11.23
state that the volume *V _{n} *must approach zero, but at a rate
slower than 1 /

Figure 11.10: A simulation of how the Parzen-Window method works.

The Parzen window
estimate can be considered as a sum of boxes centered at the observations, the
smooth kernel estimate is a sum of boxes placed at the data points (Figure 11.10). The
kernel function determines the shape of the boxes. The parameter
*h _{n}*, also called the

The problem of
choosing the window width is crucial in density estimation. A large window width will over-smooth the
density and mask the structure in the data. A small bandwidth will yield a
density estimate that is spiky and very hard to interpret (Figure 11.11). We would
like to find a value of the smoothing parameter that minimizes the error
between the estimated density and the true density. A natural measure is the mean square error at the
estimation point **x**. This expression is an
example of the bias-variance tradeoff; the bias can be reduced at the expense
of the variance, and vice versa. The
bias of an estimate is the systematic error incurred in the estimation. The
variance of an estimate is the random error incurred in the estimation. The bias-variance dilemma applied to window width selection simply
means that; a large window width
will reduce the differences among the estimates of density function for
different data sets (the variance). A small window width will reduce the bias
of density function, at the expense of a larger variance in the estimates of
density function (Figure 11.12).

Figure 11.11: A simulation of the problem of choosing the window width.

Figure 11.12: The bias-variance tradeoff.

Consider first _{}, the mean of *p _{n}*(

_{}**
**

_{}

_{}

** _{}
**
(11.24)

This equation shows
that the expected value of the estimate is an averaged value of the unknown
density-a *convolution *of the unknown density and the window function.
Thus, _{} is
a blurred version of *p*(**x**)* *as seen through the averaging
window. But as *V _{n} *approaches zero,

Equation 11.24
shows that there is no need for an infinite number of samples to make _{} approach *p*(**x**);*
*one can achieve this for any *n *merely by letting *V _{n} *approach
zero. Of course, for a particular set of

_{}

_{}

** _{}
**
(11.25)

By dropping the second term, bounding *j** _{ }*(.)

** _{}
**
(11.26)

Clearly, to obtain
a small variance we want a large value for *V _{n} *not a small one
a large

This is the
principal theoretical result. Unfortunately, it does not tell us how to choose *j** _{ }*(.) and

To show how the
Parzen-window method can be implemented as a multilayer neural network known as
a Probabilistic Neural Network is given in (Figure 11.13). The PNN is trained
in the following way. First, each pattern **x** of the training set is
normalized to have unit length, that is, scaled so that **S**_{}*x*** _{}=**1. The first
normalized training pattern is placed on the input units. The modifiable
weights linking the input units and the first hidden unit are set such that

Figure 11.13: PNN implementation as multilayer neural network.

__Algorithm (PNN Training)__

**begin initialize ***j *¬ 0, *n, a _{ji}
*¬ 0

** do
**

*x _{jk }*¬

_{
}*w _{jk }*¬

**
if
x ****Î***w _{i}*

** until ***j *= *n*

**end**

* *

The trained network
is then used for classification in the following way. A normalized test
pattern **x** is placed at the input units. Each hidden unit computes the
inner product to yield the *net activation *or simply *net,*

** _{}
**
(11.27)

and emits a
nonlinear function of *net _{k} *each output unit sums the
contributions from all pattern units connected to it. The nonlinear function is

** _{} **
(11.28)

where we have used our normalization conditions **x**^{T}**x**=**w _{}w**

__Algorithm (____PNN Classification)__

**begin initialize ***k *¬ 0, *n, ***x***
*¬ test pattern* *

** do
**

_{}

* _{
}w_{jk }*¬

**
if ***a _{ki}*

** until ***k *= *n*

** return _{}**

**end**

One of the benefits
of PNNs is their speed of learning, because the learning rule (i.e., setting **w*** _{k}*=

As we have seen,
one of the problems encountered in the Parzen-window/PNN approach concerns the
choice of the sequence of cell-volume sizes *V*_{1},*V*_{2}…- or
overall window size (or indeed other window parameters, such as shape or
orientation). For example, if we take *V _{n}*=

These techniques
can be used to estimate the posterior probabilities *P*(*w _{i}|*

** _{}
**
(11.29)

and thus a
reasonable estimate for *P*(*w _{i}|*

** _{}
**
(11.30)

That is, the
estimate of the posterior probability that *w _{i}* is the state of
nature is merely the fraction of the samples within the cell that are labeled

When it comes to
choosing the size of the cell, it is clear that we can use either the
Parzen-window approach or the *k _{n}*-nearest-neighbor approach.
In the first case,

There are two
overarching approaches to nonparametric estimation for pattern classification:
In one the densities are estimated (and then used for classification), while in
the other the category is chosen directly. Parzen windows and their hardware
implementation, probabilistic neural networks, exemplify the former approach.
The latter is exemplified by *k*-nearest-neighbor and several forms of
relaxation networks.

A potential
solution for the problem of the unknown *best window function *is to let
the cell volume be a function of the training data*, *rather than some
arbitrary function of the overall number of samples. For example, to estimate *p*(**x**)* *from
training samples or *prototypes *we can center a cell about **x** and
let it grow until it captures *k _{n} *samples, where

Figure 11.14: *k _{n}*-Nearest Neighbor estimation (kNN).

Figure 11.15: The difference between estimation methods for density function.

** **

If the density is
high near **x **(Figure 11.14), the cell will be relatively small, which leads to good resolution.
If the density is low, it is true that the cell will grow large, but it will
stop soon after it enters regions of higher density. In either case, if we take

** _{}
**
(11.31)

we want *k _{n}
*to go to infinity as

If we base our
decision solely on the label of the single nearest neighbor of **x **we can
obtain comparable performance. We begin by letting D* ^{n}*={

Figure 11.16: Eight points in one dimension and the
k-nearest-neighbor density estimates, for *k*=3
and 5. The discontinuities in the slopes in the estimates generally lie away
from the positions of the prototype points.

An extension of the
nearest-neighbor rule is the *k-nearest-neighbor rule.* In the *k*NN
method we grow the volume surrounding the estimation point **x*** *until it encloses a total of *k _{n}*

** _{}
**
(11.32)

Since the volume is
chosen to be spherical, the volume *V _{n}* can be written in terms
of the volume of a unit sphere times the distance between the point

** _{}
**
(11.33)

Combining eq. 11.32 and 11.33, the density estimation becomes

** _{}
**
(11.34)

where

_{}

_{}

The volume of the unit sphere *c _{d} *is given by

** _{}
**
(11.35)

where *c*_{1}=2, *c*_{2}=p,* c*_{3}=4p/3 and so on (Figure 11.17).

**(a)
(b)**

Figure 11.17: a. The simulation of the *k*-nearest neighbor
method, and b. *k*=5 case. In this case, the test point x would be labeled the category of the
black points.

** **

In general, the
estimates that can be obtained with the kNN method are not very satisfactory. The estimates are prone to local noise. The
method produces estimates with very heavy tails. Since the function *R** _{k}*(

**(a)**

**(b)**

Figure 11.18: An illustration of the performance of *k*-nearest
neighbor method. a. The true density which is a mixture of two bivariate
Gaussians, and b. the density estimate for *k*=10 and *n*=200
examples.

**(a)
(b)**

Figure 11.19: Contour plot of Figure 11.18. a. The true density, and b. the kNN density estimate.

** **

The main advantage
of the kNN method is that it leads to a very simple approximation of the
(optimal) Bayes classifier. Assume
that we have a dataset with *n* examples, *n*_{i}* *from class
*w** _{i}*, and that we are interested in classifying an unknown
sample

** _{}
**
(11.36)

Similarly, the unconditional density is estimated by

** _{}
**
(11.37)

and the priors are approximated by

** _{}
**
(11.38)

Putting everything together, the Bayes classifier becomes

** _{}
**
(11.39)

The *k-*Nearest
Neighbor Rule (kNN) is a very intuitive method that classifies unlabeled
examples based on their similarity to examples in the training set. For a given unlabeled example **x*** _{j}*, find the

·
An integer *k*,

· A set of labeled examples (training data)

· A metric to measure “closeness”

In the example in Figure
11.20, we have three classes and the goal is to find a class label for the
unknown example **x*** _{j}* . In this case we use the Euclidean distance and a value of

Figure 11.20: kNN classification example.

*k*NN is considered as a *lazy* learning algorithm. It *defers *data processing until it receives
a request to classify an unlabelled example, *replies *to a request for
information by combining its stored training data, and *discards *the
constructed answer and any intermediate results.

__Advantages: __

· Analytically tractable,

· Simple implementation,

·
Nearly optimal in the large sample limit (*n *® ¥)

· Uses local information, which can yield highly adaptive behavior, and

· Lends itself very easily to parallel implementations.

__ __

__Disadvantages__:

· Large storage requirements,

· Computationally intensive recall,

· Highly susceptible to the curse of dimensionality.

** **

The probabilistic neural network and the classification algorithm is simulated using MATLAB. The network used for this demonstration is shown in Figure 11.21. Given the samples of three different classes as shown in Figure 11.22, the training is implemented, and then, a new sample is presented to the network for classification as shown in Figure 11.23. The PNN actually divides the input space into three classes as shown in Figure 11.24.

__Matlab script:__ ccPnn.m

Figure 11.21: The PNN structure used for the classification example.

Figure 11.22: The PNN structure used for the classification example.

Figure 11.23: The PNN structure used for the classification example.

Figure 11.24: The PNN structure used for the classification example.

[1] Duda, R.O., Hart, P.E., and Stork
D.G., (2001). *Pattern Classification*. (2^{nd} ed.). New York:
Wiley-Interscience Publication.

[2] *Neural Network Toolbox For Use
With MATLAB, *User’s Guide, (2003) Mathworks Inc.,
ver.4.0

[3] Gutierrez-Osuna R., *Introduction
to Pattern Analysis, *Course Notes, Department of Computer Science, A&M
University, Texas

[4] Sima J. (1998). *Introduction to Neural Networks, *Technical Report No. V
755, Institute of Computer Science, Academy of Sciences of the Czech Republic

[5] Kröse B., and van der Smagt P. (1996). *An* *Introduction
to Neural Networks*.* *(8^{th} ed.) University of Amsterdam
Press, University of Amsterdam.* *

[6] Gurney K. (1997). *An* *Introduction to Neural Networks*.* *(1^{st}
ed.) UCL Press, London EC4A 3DE, UK.* *

[7] Paplinski A.P. *Neural
Nets*.* *Lecture Notes, Dept. of Computer
Sciences, and Software Eng., Manash University, Clayton-AUSTRALIA

[8] Bain J. L, and Engelhardt M.
(1991) *Introduction to Probability and Mathematical
Statistics*.* *(2^{nd} ed.), Duxbury Publication, California-USA