Learning Mechanisms which Construct
Neighbourhood Representations

-- ABSTRACT ------------------------------------------

Learning is currently the focus of much research activity in cognitive
science. But, typically, this research is oriented towards either the
symbol-processing paradigm or the connectionist paradigm and therefore tends
to generate models of two quite different types. A satisfactory *theory* of
learning will, presumably, deal with the phenomenon in general terms rather
than in terms of two special cases, so there would appear to be a need to try
to identify abstractions which generalise the two types of model typically
produced. The aim of the present paper is to identify one such abstraction. It
puts forward a framework in which the behaviours of two symbol-processing
learning mechanisms are directly commensurable with the behaviours of three
connectionist learning mechanisms and shows that there is at least one
theoretical constraint affecting *all* mechanisms covered by the framework.

-- Introduction --------------------------------------

At the present time computational research on learning falls into two broad
categories. Firstly, there is research concentrating on the learning
properties of symbol processing (SP) mechanisms, e.g. [Michalski et al. BOOK
1983,Michalski et al. BOOK 1986]; secondly there is research concentrating on
the learning properties of parallel distributed processing (PDP) mechanisms,
e.g. [Rumelhart McClelland 1986]. The first category of research is in the
traditional AI paradigm; the second, the connectionist paradigm.

At the implementation level, connectionist learning mechanisms have little in
common with symbol-processing learning mechanisms. In order to perceive any
commonality at all it is necessary to move to a more abstract view of the
learning task where implementation details are effectively hidden. One
framework providing such a view has been presented by Valiant [Valiant
1984,Valiant 1985]. Another has been presented by Michalski [Michalski,
Understanding the Nature of Learning, 1986]. In Michalski's framework, leaning
is defined as the process of "constructing or modifying representations of
what is being experienced." [Michalski, Understanding 1986, p. 10]. This seems
to apply quite well to both PDP and symbol processing systems, c.f.,
[Rumelhart McClelland 1986,Quinlan 1983,Mitchell Utgoff Banerji 1983]; but
although the definition is very general it is also, necessarily, quite vague.
It has quite different meanings depending on the interpretation we give to
"representation".

In the present paper, we explore the consequences of assuming that
"representation" as used in the Michalski definition means something specific,
namely "neighbourhood representation". We first describe what sort of
structure a neighbourhood representation is. We then try to show that the
behaviour of two symbol processing mechanisms and three PDP mechanisms can be
interpreted in terms of its construction. Finally, we argue that this
particular strategy is subject to a constraint, and that the mechanisms which
exploit it are therefore limited in terms of what particular types of
experience (i.e. input) they can represent.

The remainder of the paper is organised as follows. In the second section, the
neighbourhood representation scheme is characterised and defined. Section
three presents an illustrative example of the scheme. Sections four and five
argue (respectively) that the behaviour of both the Classification algorithm
[Quinlan 1983] and the Candidate-Elimination algorithm [Mitchell 1977] can be
understood in terms of the construction of neighbourhood representations. The
following three sections put forward similar arguments for the Perceptron
[Rosenblatt 1962], the Pattern Associator [McClelland Rumelhart 1988] and the
Back-Propagation mechanism [Rumelhart McClelland 1986]. Section nine argues
that mechanisms which can only form neighbourhood representations are unable
to effectively represent certain classes of input. The final section presents
a summary of the points made.

-- Neighbourhood representations ---------------------

.nr Bp \n(Bp-2
.BS
           unrestricted
                |
     ------------------------
    |                        |
non-coupled               coupled
                             |
              -----------------------------
             |                             |
         generative                  classificatory
             |                             |
       --------------                --------------
      |              |              |              |
deterministic  probabilistic  deterministic  probabilistic
.BE +

The word "representation" has been used in AI to denote computational
structures of various different types. Specific examples can be characterised
in terms of the type of relationship which is presumed to hold between the
representation in question, and the represented object or objects. Let us
assume that C denotes an arbitrary set of objects and that c denotes a
representation of C. Now, c may represent C merely by virtue of the fact that
we imbue it with a certain interpretation; c.f., a list of names considered as
a representation for a group of people. Alternatively, c may represent C in a
more formal way: it may be a program which generates C, or it may be a
predicate which determines whether some arbitrary object is a member of C.
((In this latter case, c forms a *concept* for the class.))
                                                                                     
Let us say that c is "coupled" only if it has a formal, computational
relationship with C. Let us say that it is "classificatory" if it forms a            
predicate which correctly classifies descriptions according to whether or not
they are in C, and that it is "generative" if it is a program which generates        
C. Finally, let us say that it is "deterministic" if it generates/classifies         
(elements of) C perfectly, or "probabilistic" if it does so to some given            
level of accuracy. These distinctions are summarised in the taxonomy shown in        
Figure \nF.                                                                          

We now define a *neighbourhood representation* as a coupled representation of
a certain type. D will be assumed to denote the data language.((In concept
learning work D would normally be called the description language. In
connectionist work, it would be called the input language or input space.)) C
will continue to label the set of elements which is represented by the
hypothetical representation c. We will say that any coupled representation
forms a neighbourhood representation if there is an interpretation of c in
which the represented set C is defined in terms of the points inside m
non-degenerate regions of an n-dimensional space.

-- Example -------------------------------------------

Imagine that a data element is a 2-tuple with the first component drawn from
the set {huge vast big tiny med micro} and the second component drawn from the
set {moped bike jet car prop glider}. Possible elements are then pairs such as
<huge car> and <big bike>.
.BS

       -----------------------------------------------------
micro |        |        |        |        |        |        |
      |--------############################--------|--------|
tiny  |        #        |        |        #        |        |
      |--------#--------|--------|--------#--------|--------|
med   |        #        |        |        #        |        |
      |--------#--------|--------|--------#--------|--------|
big   |        #        |        |        #        |        |
      |--------############################--------|--------|
huge  |        |        |        |        |        |        |
      |--------|--------|--------|--------|--------|--------|
vast  |        |        |        |        |        |        |
       -----------------------------------------------------
         bike     moped    car      prop     jet     glider
.BE +

.ne 5
C is a set of nine 2-tuples:

  {<big moped> <med car> <med moped>
   <tiny car> <big prop> <med prop>
   <tiny moped> <big car> <tiny prop>}

Consider the case where each set of terms forms the sequence of dimensional
values for one dimension of the 2-dimensional space shown in Figure \nF. In
this context, C can be represented in terms of the points inside a single
region of the space. The region might be defined in terms of the 2-tuple
corresponding to its top, left cell, and the 2-tuple corresponding to its
bottom-right cell; i.e. <tiny moped> and <big prop>. A representation of this
type counts as a neighbourhood representation because it defines C in terms of
the points inside m regions of the space (with m = 1).

-- Decision trees as neighbourhood representations ---

We now turn attention to the task of showing that the behaviour of some
learning mechanisms can be interpreted in terms of the *construction* of
neighbourhood representations. We will first consider two symbol processing
mechanisms. Later we will consider some PDP mechanisms.

One of the largest families of symbol processing learning mechanisms is
derived from Quinlan's ID3 program [Quinlan, Induction of Decision Trees,
1986,Quinlan, Learning from Noisy Data, 1986,Quinlan, 1983,Quinlan,
1979,Shapiro Niblett 1982]. The main component in this program is the
Classification algorithm [Bundy Silver Plummer 1985] which has roots in the
Concept Learning System of Hunt, Marin and Stone [Hunt Marin Stone, 1966]. The
Classification algorithm builds a decision tree characterising a given class
of elements (i.e. instances). The tree can be interpreted as a classification
rule or as a concept definition. The data (description) language for the
Classification algorithm is a feature space and dimensions of this space range
over sets of mutually exclusive feature values. If a data element has n
attributes then there are n mutually exclusive feature sets and n dimensions
in the feature space.
.BS

                      "colour"
                         |
             ---------------------------
            |       |                   |
          green    blue                red
            |       |                   |
         "shape"   yes               "shape"
            |                           |
    ------------------          ------------------
   |        |         |        |        |         |
 brick    sphere    wedge    brick    sphere    wedge
   |        |         |        |        |         |
   no       no       yes       no       no       yes

.BE +

Consider the case where there are just two mutually exclusive sets of
attributes: {red green blue} and {large small medium}. These sets contain the
possible values for a "colour" feature and a "size" feature respectively. The
set C which we want to represent is known to contain the pairs <blue wedge>,
<green wedge>, <blue brick>, <red wedge>, and <blue sphere>; but to exclude
<red brick> and <green sphere>. The sort of decision tree which might be
learned by the Classification algorithm presented with the set of included and
excluded elements (i.e. positive and negative examples) is shown in Figure
\nF. It corresponds to the disjunctive rule

    blue or (green and wedge) or (red and wedge)

which might be abbreviated to

    (blue or wedge)

For this to count as a neighbourhood representation it must be the case that
there is an interpretation of the rule in which C is defined in terms of the
points inside m regions of some n-dimensional space. Note that the
exclusion of a specific feature value from a conjunct in a rule of the type
shown captures the fact that elements of C may have *any* value for that
feature. Thus each conjunct effectively identifies a certain region of the
feature space. The boundaries of the region in any dimension (feature) which
is unspecified enclose all the values in the dimension. The boundaries in any
dimension which is specified enclose just the specified value.\**
.(f
\** The sequencing of dimensional values in this case is irrelevant.
.)f
Thus the rule (blue or wedge) which correctly classifies elements of C, can be
interpreted as representing C in terms of a set of points within two
rectangular regions of the feature space as shown in Figure \n+F. Descriptions
which are known to be in C are shown as plus symbols. Descriptions which are
known to be excluded are shown as minus symbols.
.BS
      ---------------ooooooooo
     |       |       o       o
red  |   -   |       o   +   o
     |       |       o       o
     |-------|-------o-------o
     |       |       o       o
green|       |   -   o   +   o
     |       |       o       o
     #########################
     #       |       o       #
blue #   +   |   +   o   +   #
     #       |       o       #
     #########################
       brick  sphere   wedge
.BE #

-- The Candidate-Elimination Algorithm ---------------

Another large family of symbol processing learning mechanisms centres on
Mitchell's Candidate-Elimination algorithm [Mitchell 1977,Mitchell 1982]. This
has (conceptual) roots in Winston's ARCH program [Winston 1970,Winston 1975]
and has recently been provided with a simple implementation in the form of the
Focussing algorithm [Bundy Silver Plummer 1985].((In fact, Focussing may not
perfectly implement the Candidate-Elimination algorithm, see [Bundy Silver
Plummer 1985, p. 162].)) Systems using this algorithm include the LEX system
[Mitchell Utgoff Banerji 1983] and the AQ programs associated with Michalski
and co-workers [Michalski Chilausky 1980,Dietterich London 1982].

The Candidate-Elimination algorithm performs the task of *learning from
examples*. That is to say, it derives a representation for some concept c
from a sequence of positive and negative examples of c. Intermediate steps in
the algorithm involve the manipulation of a version space and termination
occurs when the version space becomes empty [Mitchell 1977]. The final output
is a representation which covers (i.e. returns a true classification for) all
the presented positive examples but none of the presented negative examples.
.BS

            undef_sized                         transport
                |                                  |
        -----------------                  ------------------
       |                 |                |                  |
     small             large           vehicle             plane
       |                 |                |                  |
   -----------      ----------       ------------      -----------
  |     |     |    |    |     |     |      |     |    |     |     |
micro  tiny  med  big  vast  huge  bike  moped  car  prop  jet  glider
.BE +

There are a variety of possible implementations for the Candidate-Elimination
algorithm. We will consider the case where the data language is defined in
terms of n generalisation hierarchies, c.f., [Bundy Silver Plummer 1985]. In
this context the data (description) language D is just the cartesian product
of the n sets of leaf nodes and a representation (concept) is any element of
the cartesian product of the n sets of nodes. A representation covers all the
elements of D which can be "grammatically derived" from it. An element can be
grammatically derived from some representation if it can be constructed from
leaf nodes which are descended from the nodes in the representation [Utgoff,
Shift of Bias, 1986].
.nr T \nF
.BS

                                         transport
                                             |
                                   ----------------------
                                  |                      |
                             # vehicle                 plane
                                  |                      |
                          ---------------        ---------------
                         |        |      |      |        |      |
                        bike    moped   car    prop    jet    glider
                      ########################-----------------------
            ----micro #       |       |      #       |       |       |
           |          #-------|-------|------#-------|-------|-------|
   --- # small-- tiny #       |       |      #       |       |       |
  |        |          #-------|-------|------#-------|-------|-------|
  |         ----- med #       |       |      #       |       |       |
undef_sized           ########################-------|-------|-------|
  |         ----- big |       |       |      |       |       |       |
  |        |          |-------|-------|------|-------|-------|-------|
   ----- large-- huge |       |       |      |       |       |       |
           |          |-------|-------|------|-------|-------|-------|
            ---- vast |       |       |      |       |       |       |
                       ----------------------------------------------

.ES 40
Figure \n+F. Region enclosing data elements
(i.e. cells) covered by <small
vehicle>.
.BE

Consider the generalisation hierarchies shown in Figure \nT. These define a
data language in which elements are 2-tuples such as <big car> and <tiny jet>.
Representations are 2-tuples which may contain non-terminals; e.g. <large
bike>. Note that in an n-dimensional space where each boundary ranges over the
left-to-right sequence of the leaf nodes from one of the n generalisation
hierarchies, specific elements correspond to specific cells of the space.
Moreover, each term in a representation effectively defines a subrange of the
corresponding dimension. In this context, a complete representation defines a
rectangular (in general, a hyper-rectangular) region of the space. This region
encloses all the cells corresponding to covered elements; see Figure \nF. We
can, therefore, legitimately interpret a representation constructed by the
Candidate-Elimination algorithm (in the case where the instance language is
defined using n generalisation hierarchies) as a neighbourhood representation
[Thornton, Hypercuboid-Formation Behaviour, Proceedings of, 1987].

-- The Perceptron ------------------------------------

Attention now turns from symbol processing learning mechanisms to PDP
mechanisms. For present purposes, a "PDP learning mechanism" will be taken to
be a combination of some specific network architecture and some specific
learning rule. In fact, we will only consider mechanisms which use the delta
learning rule (in either its generalised or ungeneralised form) [Rumelhart &
McClelland, 1986, chapters 2 and 8]. We will look at both the Pattern
Associator (also known as the associative net) and the back-propagated
feed-forward net which we will refer to as the Back-Propagation mechanism. But
first we will consider a very simple case: that of the Perceptron [Rosenblatt
1962]. This mechanism has been subjected to a rather rigorous analysis by
Minsky and Papert [Minsky Papert 1988] and in fact the interpretation of its
behaviour which we will develop is close to the usual one for this mechanism;
c.f., [Forsyth, Machine Learning Strategies, 1984].

The essential architecture of the Perceptron is very simple. A number of input
units are connected to one output unit and each connection is associated with
a weight. The set of weights is referred to as the weight vector. Elements of
the data language (i.e. inputs) are sets of activation values for the input
units (i.e. input vectors) and each activation value is just a real number
between 0 and 1. The output unit is associated with a threshold and produces a
1 if the inner product of the input vector and the weight vector (i.e. the sum
produced by multiplying each input value with the appropriate connection
weight) exceeds the threshold; otherwise it outputs a 0. The Perceptron has
derived a classificatory representation for some set C if it always outputs a
1 whenever the input vector is in C, and 0 in all other cases. This
representation consists of the weight vector and the threshold.

Since the elements of D are n-tuples of real numbers they are very easily
interpreted as points in an n-dimensional space. In this case the weight
vector and threshold form a hyperplane separating the inputs which produce a 1
output from the inputs which produce a 0 output [Minsky Papert 1988]. This
representation effectively treats the data (input) language as a space and
defines the represented class C in terms of the points in a single region of
this space; i.e. the region falling to one side of the hyperplane.
Representations constructed by the Perceptron are therefore legitimately
viewed as neighbourhood representations.

-- The Pattern Associator ----------------------------

The Pattern-Associator (or associative net) is closely related to the
Perceptron; its basic architecture is quite similar, but whereas, in the
Perceptron, the n input units are connected to a single output unit, in the
Pattern Associator, the n input units are connected to a *set* of m output
units. Furthermore, there is no thresholding. The output of the mechanism for
a given input is the set of activations of the output units.

In general, the Pattern-Associator is construed as a mechanism which forms
associations between input and output patterns (hence the name). However, for
present purposes we are interested in viewing its behaviour in terms of the
construction of coupled representations. We will say that a Pattern-Associator
has formed a classificatory representation for some set of inputs C if it
always gives the same output when presented with an element of C, but gives
a different output in all other cases. The activation of a given
output unit for a given input is the inner product of the input vector
with the output unit's weight vector. If vectors are normalised the
inner product of two vectors is just the projection of one on the other
[Rumelhart McClelland 1986]. This means that we can think about the
relationship between an input vector and a given activation of an output unit
in geometric terms.
.BS
.so pics/pa_1invec.di
.BE +
.nr T \nF
.BS
.so pics/pa_2invec.di
.BE +
.nr X \nF
.BS
.so pics/pa_3invec.di
.BE +
.ds Od \\f2O\\d\\s-2d\\s+2\\u\\f1

Let us consider the special case where the Pattern-Associator has only one
output unit u and input vectors have only two components. We know that the
activation of u for some particular input is just the projection of the input
vector on u's weight vector. This means that, to achieve, some particular
activation \*(Od for a given input, u's weight vector must be positioned such
that its tip lies somewhere along a perpendicular bisecting the input vector
at a point corresponding to the desired output; see Figure \nT. (Note that the
vectors in the diagrams are not drawn to scale.) To achieve some particular
output for two different input vectors, the tip of the weight vector must lie
along two different perpendiculars (i.e. the perpendiculars which strike the
input vectors at the point corresponding to \*(Od). This means that the tip of
the weight vector must be positioned at the intersection of the two
perpendiculars; see Figure \nX.
.nr X \nF
.BS
.so pics/pa_error1.di
.BE +

In the case where we want u to give the same output for three different input
vectors, there will be three perpendiculars and no unique point of
intersection. However, the outputs provided by u will be as near as possible
to \*(Od if the total distance between the tip of the weight vector and the
perpendiculars is minimised; see Figure \nX. Application of the delta learning
rule is, in fact, guaranteed to achieve this effect [Rumelhart McClelland
1986]. The general implication is that the Pattern-Associator cannot form
deterministic representations for more than two data elements (input vectors)
in the case where elements have only two components. In general, it cannot
form deterministic representations for more than n elements in an
n-dimensional input space.((This is just a manifestation of the fact that the
Pattern-Associator can only learn to associate *linearly independent* patterns
[Rumelhart McClelland 1986].))

Probabilistic representations are a possibility however. We can view the
Pattern-Associator as having formed a representation for some set of elements
C if the output given by u in the case where the input is in C is always
within some small range of values. This allows us to view the weight vector
shown in Figure \nF as a representation for a class C consisting of the three
elements shown. The output produced for any input in C will always be within
some given range centred on the desired value \*(Od, as shown in Figure \nF.
.BS
.so pics/pa_gen1.di
.BE +

The interesting thing about this representation is that it automatically
generalises all vectors which lie between the given three inputs. The output
by u will always be within the given range whenever there is a
perpendicular through the input vector which (1) passes through the tip of the
weight vector and (2) strikes the input vector at a point which is within the
defined output range; see Figure \nF. The set of elements satisfying this
description form a region in the vector space; see Figure \n+F. Thus, in this
context, we can interpret the representations formed by the Pattern-Associator
as neighbourhood representations.
.BS
.so pics/pa_gen2.di
.BE #

-- The Back-Propagation Mechanism --------------------

We now turn attention to the back-propagated, feed-forward net [Rumelhart
McClelland 1986, chapter 8]. which we will refer to as the "Back-Propagation
mechanism" The architecture of this mechanism consists of a layer of input
units which feed activation to n layers of hidden unit. Units in the last
layer of hidden units feed activation to n output units. The Perceptron and
Pattern-Associator correspond to special cases of the Back-Propagation
mechanism. The Perceptron corresponds to the case where there is one output
unit, no hidden units, and the activation function used is effectively a
threshold (step) function. The Pattern-Associator corresponds to the case
where there are n output units, no hidden units and activations are computed
using a linear function. Since the Back-Propagation mechanism generalises both
the Perceptron and the Pattern-Associator we can infer that it is capable of
forming neighbourhood representations in data space in both Perceptron-like
fashion and Pattern-Associator-like fashion. It turns out that these are not
the only possibilities.

Let us consider the classic case in which Back-Propagation is used to
construct a network which can compute the exclusive-or relation. This is, of
course, the paradigm example of a task which cannot be performed by the
Perceptron [Minsky Papert 1988]. We need not concern ourselves with the
details of the learning mechanism (the generalised delta rule) since we are
only interested in the form of the representations which are learned. In order
to maintain uniformity, we will view a network which can compute exclusive-or
as a network which has effectively constructed a representation for a certain
subset of data elements (input vectors). That is to say, we will consider that
the ability to compute the exclusive-or of pairs of binary inputs is the same
as having a classificatory representation for the set {<0,1>, <1,0>} in the
case where the data language is just the set of all pairs of binary
digits.

Figure \n+F is a representation of the sort of network which might be produced
by Back-Propagation for the exclusive-or task (after [Rumelhart McClelland,
1986, p. 321]). Small rectangles correspond to units: rectangles in the lowest
layer are the input units, the rectangle in the middle layer is the (single)
hidden unit while the unit in the top layer is the (single) output unit. Lines
between rectangles correspond to links between units. The number appearing in
the top, left corner of a rectangle is a label for the corresponding unit,
while the number appearing within a rectangle is the unit's threshold. Numbers
on lines are just the weights on the corresponding links.

The network works in a very simple way. The result of turning both input units
on (i.e. of presenting the element <1,1>), is that the hidden unit comes on
and turns the output unit off. If an output of 0 is interpreted as implying
that the input is *not* in the represented set, then the behaviour in this
case corresponds to a correct classification decision. The result of turning
just one input unit on and the other off is that the hidden unit remains off
but the output unit comes on. The result of turning neither input unit on is
that both the hidden unit and the output unit stay off. All of these cases
correspond to correct classifications.
.BS

         ^
         |
         |
       4---
       |0.5|
        ---
         |
  -------|-------
 |       |-2     |
 |       |       |
 |     3---      |
 |+1   |1.5|     |+1
 |      ---      |
 |      | |      |
 |    +1| |+1    |
1--     | |    2--
|  |----   ----|  |
 --             --
 ^               ^
 |               |
 |               |

.ES 35
\*(Fs. Exclusive-OR Network (after [Rumelhart McClelland 1986, p. 321])

.BE

Is there any way of viewing the overall state of the network as a
neighbourhood representation for the set {<1,0>, <0,1>}? Each unit in the
network can be in one of two states (either on or off). Thus we can view the
set of possible states for the two non-input units in terms of a 2-dimensional
space in which each boundary ranges over the possible states of one unit. If
we then write into each cell in this space the data element which achieves the
given state (of the two non-input units), we find that the elements in the
represented set occupy contiguous cells. Obviously the representation
implemented is made up of the connection weights. But in light of the remarks
made above, we can see the weights as mapping states of the input units into
the state-space for the non-input units in such a way that the input states
corresponding to represented elements are mapped onto points in a *region* of
the state-space.

Under this interpretation, an "on" state in the output unit is just an
indication that the data element has instantiated a point in the state-space
which is inside the defined region. Thus, there is an interpretation of the
representation in which the elements in the represented set C are defined in
terms of the points inside a region of the activation space for a subset of
(non-input) units; see Figure \n+F. The overall conclusion is that, in this
particular case, the representation constructed by Back-Propagation forms a
neighbourhood representation and that it does so in an activation space for a
subset of non-input units.((Broomhead and Lowe [Broomhead & Lowe,
Multivariable Functional Interpolation, 1988] have shown recently that network
architectures developed by Back-Propagation can be viewed as "devices for the
interpolation of data in multi-dimensional space" (p. 346) using radial basis
functions. The notion that the Back-Propagation mechanism can form
neighbourhood representations in unit-activation space may turn out to be an
informal characterisation of their result.))
.BS

             ---------###########
            |         #         #
        on  |  <1,1>  #  <1,0>  #
            |         #         #
unit 3      | --------#---------#
            |         #         #
        off |  <0,0>  #  <0,1>  #
            |         #         #
             ---------###########

              off         on

                  unit 4

.ES 30
\*(Fs. The state space for units 3 and 4. Numbers in cells represent the input
pairs which achieve the given state.
.BE

- NETtalk -------------------------------------------

A compelling demonstration of the power of the Back-Propagation mechanism is
presented in [Sejnowski Rosenberg 1987]. This describes how Back-Propagation
was used to train a network to produce correct pronunciations (or at least
descriptions of correct pronunciations) for words in a large corpus of text.
The network used had one layer of hidden units. Inputs were effectively
descriptions of the letters in a small, moving "window" onto the corpus of
text. Outputs were descriptions of the correct pronunciation for the letter in
the centre of the window.

The network achieved a rather impressive level of performance on this task. We
can view its final state as a kind of complex, classificatory representation
for a large set of classes. We see each of the n possible outputs
as the name of a class of input (i.e. the class of inputs which all require a
certain pronunciation), and we see the output in a specific case as
classifying the input in terms of these classes.

Can these classificatory representations be viewed as neighbourhood
representations? The answer is not straightforward. The number of hidden units
in the network mean that the sort of approach taken in the case of the
exclusive-or network is out of the question. However, some insight can be
gained by considering the results of the hierarchical cluster analysis which
the authors applied to the system. The aim of this analysis was to try to
determine whether the network had constructed an internal representation of
the mapping between specific letters and specific sounds. The analysis was
carried out in terms of *functional vectors* each of which corresponded to a
particular transformation implemented by the network [Churchland 1989].

Where did these functional vectors come from? Note that in any particular
case, the network accepts an input and produces an output. In the process, all
the hidden units take on a given activation. Each activation is just a real
number between 0 and 1 so the set of activations in the n hidden units can
conveniently be viewed in terms of a point in an n-dimensional space. The
states of the hidden part of the network produced by each of the possible
inputs thus form a set of points in an n-dimensional space.

Now, in each particular learning experiment, the network arrived at a
different global state. But this does not imply that in each case it arrived
at a different representation for the associations between letters and correct
pronunciations. The representation for a particular association is just the
functional vector involved in the implementation of that association. And the
representation for the set of associations is the complete set of
functional vectors. Application of hierarchical cluster analysis to the
functional vectors for the NETtalk networks showed quite clearly that, in each
case, the learning procedure produced the *same* internal representation.
Moreover it showed clearly that the representation was a "spatial" one: each
letter/sound association was represented by a particular functional vector and
the closeness of one vector to another symbolised to the degree of similarity
between the corresponding associations [Sejnowski & Rosenberg 1987].

In effect, the learning procedure produced a representation in which regions
of functional-vector space symbolised particular classes of letter/sound
association. A dramatic demonstration of this was the fact that in the
representation, the vector space was partitioned into two distinct regions,
with one region corresponding to the class of vowel pronunciations, the other
corresponding to the class of consonant pronunciations [Sejnowski Rosenberg
1987]. Now, clearly, this sort of representation counts as a neighbourhood
representation; but it is important to distinguish it from the type of
representation constructed by the Perceptron and Pattern-Associator (data
space representation) and by the Back-Propagation mechanism in the previous
example (unit-activation space representation). The NETtalk system
appears to construct neighbourhood representations in functional-vector space.
Thus, overall, the Back-Propagation mechanism appears to be capable of
constructing neighbourhood representations in at least three different ways.
This implies that the theoretical constraints on Back-Propagation may not be
as severe as they are in the case of the other mechanisms. It is to a
consideration of one of these constraints that we now turn.

-- A theoretical constraint ---------------------------

The arguments presented above have tried to show that a variety of PDP and
symbol processing learning mechanisms can be construed as forming
neighbourhood representations. All but one of the mechanisms can be shown to
do this *necessarily*. The single exception - the Back-Propagation mechanism -
can be shown to construct neighbourhood representations in certain cases and
to be capable of doing so in more than one space. Given the wide exploitation
of this particular learning strategy, it is interesting to inquire about its
generality and efficacy. In particular, we would like to know whether classes
can always be represented in terms of regions of points in some particular
n-dimensional space. To begin with, however, we will consider a simpler
question; namely, "can classes always be represented as subsets?" This
generalises the original question since it is clearly the case that any class
which can be represented in terms of m regions of points in some particular
n-dimensional space can also be represented in terms of a particular subset of
points.

Let us continue to use D to label the set of data elements and C to label the
class of instances we are interested in. From here on, let us also use R to
label the set of all representations and say that any class C is "concrete"
with respect to D only if C \(sb D. Now, imagine
a situation in which R = D, and D = the set of pairs which describe
individual playing cards:

   { [3, hearts] [queen, diamonds] [9, spades] ... }

If C is the class of all black cards (i.e. the set of all pairs which describe
black cards), then clearly, C \(sb D. C is classified as concrete with respect
to D since it can certainly be represented in terms of a subset of D.

We now imagine the case where D remains the same but R is defined as the
powerset of D. In this case classes may contain instances involving multiple
elements of D. An example is the class of all "straights"; i.e. the set of all
poker hands (5-tuples on D) which can be arranged as a continuous sequence of
card values. We can define this class in terms of D but not in terms of any
subset of D. We characterise this situation by saying that C is "abstract"
with respect to D.

- Very abstract classes ----------------------------------------------

Now consider the class of all "close straights". An instance of this class is
a set of N hands such that they are all straights with fairly equal overall
values. This class can be described in terms of playing cards but clearly not
in terms of any subset of playing cards. Thus it is an abstract class with
respect to D. But this is not the end of the story. Imagine D' is the set of
all descriptions of complete poker hands (5-tuples on D). Now, the class of
all straights is abstract with respect to D but concrete with respect to D'.
But the class of all close straights is abstract with respect to both. This
suggests that the class of close straights is *more* abstract with respect to
D than the class of all straights.

A formal model of this *abstractness* property is given in [Thornton, Defining
Higher-Order Classes, 1989]. For present purposes, it is sufficient to note
the implication that a class which can be defined in terms of some particular
set of objects or descriptions may not necessarily be representable in terms
of a *subset* of those objects. Since, as we have already noted, any
neighbourhood representation represents a particular class in terms of a given
subset we can infer that there will be cases in which a mechanism only
capable of forming neighbourhood representations cannot learn representations
for definable classes.

In addition to those definable classes which cannot be represented in terms of
subsets, there are classes which *can* be represented in terms of subsets but
not in terms of any set of non-degenerate regions. This may occur, for
example, when a class is known to exclude certain negative instances as well
as including certain positive instances. If, in the representation space, any
region which would enclose more than one positive instance will inevitably
enclose at least one negative instance then any neighbourhood representation
will be degenerate. A situation fitting this description (in which the
representation space is assumed to be identical to the data space) is depicted
in Figure \n+F (members and non-members of C are depicted as plusses and
minuses in the usual way). In the concept learning literature, such situations
are assumed to indicate the need for a disjunctive definition or, perhaps, an
enhanced description language [Mitchell, Utgoff & Banerji, 1983,Utgoff, Shift
of Bias, 1986].
.BS
      -----------------------
     |       |       |       |
red  |       |       |       |
     |       |       |       |
     |-------|-------|-------|
     |       |       |       |
green|       |   -   |   +   |
     |       |       |       |
     |-------|-------|-------|
     |       |       |       |
blue |       |   +   |       |
     |       |       |       |
      -----------------------
       brick  sphere   wedge
.BE #

- How the constraint applies to the Back-Propagation mechanism --------

We noted above that any mechanism which can only form neighbourhood
representations will be incapable of forming a representation for any class
which is abstract with respect to the representation language. We now consider
the degree to which this constraint affects the Back-Propagation mechanism.
There are various issues to be taken into account here. Firstly, as we have
noted, Back-Propagation appears to be able to form neighbourhood
representations in at least three different spaces: data (or input) space,
unit-activation space, and functional-vector space. If S is the complete set
of spaces in which Back-Propagation forms neighbourhood representations, then
we might assume that its use of this learning strategy constrains it in the
sense that it renders it unable to learn a representation for any class which
is abstract (or undefinable) with respect to some member of S.

Unfortunately, this assumption is invalid since it is easily shown that
Back-Propagation *can* form representations for classes which are abstract
with respect to the data language. Note that each unit in the first layer of
hidden units in a feed-forward network computes some function of the
activations of the input units (i.e. some function of a data element). The
activation of any given hidden unit will vary depending on what the input
actually is and we can therefore regard it as implementing a probabilistic
representation for a certain set of inputs; namely, the inputs which lead to
the unit having a high activation.

The activation of a given unit in the second layer of hidden units depends on
the activations of the units in the first layer. Thus, we can regard its
activation as a representation for a subset of activation patterns on those
units. Now, it will typically be impossible to represent the class of inputs
represented by a given unit in the *second* layer of hidden units in terms of
any subset of D; thus the class represented is abstract with respect to D. The
general conclusion, then, is that our original formulation of the constraint
applying to Back-Propagation must be amended: if S is the complete set of
spaces in which Back-Propagation forms neighbourhood representations, then its
of this learning strategy constrains it in the sense that it renders it
unable to learn a representation for any class which is abstract (or
undefinable) with respect to any member of S except D.

-- Summary -------------------------------------------

The paper began by suggesting that the Michalski definition of learning
provides a perspective from which the behaviour of symbol processing learning
mechanisms is commensurable with the behaviour of PDP learning mechanisms. It
provided some evidence in favour of this view by (1) making the assumption
that the word "representation" as used in the definition can be interpreted as
meaning "neighbourhood representation", and by (2) describing two symbol
processing mechanisms and three PDP mechanisms which can all be said to
construct this sort of representation. Arguments have also been presented
which show that any mechanism which can *only* construct neighbourhood
representation will be, in principle, unable to represent certain classes of
input (i.e. "experience"); and the degree to which this constraint applies to
specific mechanisms was considered.