The SDL Component Suite is an industry leading collection of components supporting scientific and engineering computing. Please visit the SDL Web site for more information....



AgglomClustering


Unit:SDL_math2
Class: none
Declaration: function AgglomClustering (Sender: TObject; InMat: TMatrix; DistanceMeasure: TDistMode; ClusterMethod: TClusterMethod; alpha: double; var ClustResult: TIntMatrix; var ClustDist: TVector; var DendroCoords: TVector; Feedback: TFeedbackProc; OnDistCalc : TOnCalcDistanceEvent): integer;

The function AgglomClustering performs an agglomerative hierarchical cluster analysis on data contained in matrix InMat. Each data object is represented as one row of the matrix, the columns are forming the variables. The parameter DistanceMeasure specifies the type of distance measurement. AgglomClustering may be aborted at any time by setting the global variable AbortMathProc to TRUE. In this case the function returns -1 as function result (otherwise a zero value is returned).

The parameter ClusterMethod specifies the type of clustering method used. If cmFlexLink is used as clustering method, the parameter alpha has to be additionally specified. Alpha may take any value between 0.5 and 1.0. A value of 0.5 results in an average linkage clustering (cmAvgLink). Higher values increase the divisive effects of the clustering process. Usually a value between 0.6 and 0.7 is preferred.

The result of the clustering process is returned in the parameters ClustResult, ClustDist, and DendroCoords. The integer array ClustResult contains the clustering information, describing which clusters (or objects) are joined to form a new cluster. This matrix consists of InMat.NrOfRows-1 rows and three columns. The rows are ordered by increasing cluster distance, which is stored in the parameter ClustDist. The parameter Sender contains the object which called AgglomClustering; it is used by the callback routine specified by the parameter Feedback. For simple applications (and small data sets) these two parameters may be set to NIL. The parameter OnDistCalc can be used to pass an event routine to the subroutine Matrix.CalcDist which is called internally in order to calculate the object distances. The OnDistCalc event is triggered only if the parameter DistanceMeasure is set to dmUserDef.

An example should clarify the situation. The results of the cluster analysis shown below have been obtained from a set of 20 observations (objects) with four variables by applying Ward's algorithm (ClusterMethod = cmWard) to it.

   ------------- ClustResult --------------        ClustDist
   number of       number of       number of
   cluster 1       cluster 2       new cluster     distance
   -----------------------------------------------------------
   2               19              21              5.0945
   1               16              22              5.3573
   3               6               23              7.2815
   9               10              24              10.2774
   8               14              25              10.6847
   12              18              26              13.0239
   4               25              27              13.5628
   24              15              28              16.0441
   5               13              29              16.5704
   7               17              30              19.2583
   23              27              31              24.1079
   11              29              32              24.2236
   26              20              33              24.6635
   22              21              34              26.9456
   31              34              35              39.2175
   32              28              36              52.7880
   36              30              37              90.4147
   35              33              38              109.4378
   37              38              39              315.1660
   -----------------------------------------------------------

The table above is to interpret as follows: clusters (objects) 2 and 19 are joined to form the new cluster 21; the distance between the two original clusters is 5.09. Next, clusters 1 and 16 are joined to form cluster 22 at a distance of 7.28, and so on. Note that any cluster numbers below or equal to InMat.NrOfRows designate the original objects, whereas higher numbers designate clusters built up of other objects and/or clusters. The results of a cluster analysis are normally displayed as a dendrogram:

DENDRO.gif

In order to facilitate the drawing of a dendrogram, the parameter DendroCoords (a vector of 2*InMat.NrOfRows -1 elements) contains the coordinates of the lines of the corresponding dendrogram. The first InMat.NrOfRows coordinates are those of the objects, the rest refer to the clusters as numbered in the matrix ClustResult (see the example program CLUSTER on details how to use the array DendroCoords).

Hint: The speed of the clustering algorithm roughly scales with the square of the number of data objects (= number of rows of the matrix InMat). Thus, the function AgglomClustering becomes increasingly slow for more than approx. 1000 data objects.



Last Update: 2012-Oct-20