SOCR EduMaterials AnalysisActivities HierarchicalClustering

From SOCR
Revision as of 11:20, 22 July 2012 by IvoDinov (talk | contribs) (Created page with '== SOCR Analysis Hierarchical Clustering Analysis Activity== === Overview === This SOCR Activity demonstrates the utilization of the […')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

SOCR Analysis Hierarchical Clustering Analysis Activity

Overview

This SOCR Activity demonstrates the utilization of the SOCR Analyses package for statistical Computing. In particular, it shows how to use Hierarchical Clustering and how to interpret the output results.

Background

Hierarchical Clustering is a technique of clustering analysis which aims to build a hierarchy (top-down ordering) of clusters. Hierarchical clustering is classified as a connectivity-based clustering algorithm, as it builds models based on distance connectivity (i.e., distances between different clusters). There are various ways to compute the distances – for example, in the single linkage method, the distance between two clusters is computed as the distance between the two closest elements in the two clusters, whereas in the complete linkage method, it is computed as the distance between the two farthest elements in the two clusters. In essence, the paradigm of hierarchical clustering boils down to the following steps:

  1. Designate every object/item to a new cluster - if you have X items, then there should be X clusters, with each of them containing just one object.
  2. Calculate the distances between all the current clusters: If each cluster only contains one object, then the distance between each pair is the same as the distance between the objects they contain. If each cluster contain more than one item, then the distance between each pair can be determined using one of the connectivity based methods: Single linkage, complete linage, unweighted average, weighted average, unweighted centroid, weighted centroid, joint between-within, etc.
  3. Based on the distances calculated from step 1, find the closest pair of clusters and merge them into a single cluster, so that there is one less cluster now.
  4. Repeat steps 2 and 3, until all objects are merged into a single cluster of size X.

The results of the above algorithm is usually presented in a dendrogram, which is essentially a tree structure displaying the top-down ordering of clusters under the selected method of distance calculation.

Activity Goals

The goals of this activity are to expose learners to:

  • Inputting data in the correct formats;
  • Reading results of Hierarchical Clustering;
  • Making interpretation of the resulting dendrogram.


SOCR Hierarchical Clustering Analysis Applet: Data Input

Go to SOCR Analyses and select Hierarchical Clustering from the drop-down list of SOCR analyses, in the left panel. There are two ways to enter data in the SOCR Hierarchical Clustering applet:

  • Click on the Example button on the top of the right panel.
  • Load a text file containing the correctly formatted data by clicking on “Load” button inside the menu bar of DENDROGRAM panel. See below for currently supported data formats.
  • Paste your own data from a spreadsheet into SOCR Hierarchical Clustering data table.

The following is a list of our supported data formats, applying to both ways of inputting (i.e., file “Load” button and copy-paste from spreadsheet)

  • Matrix-like file format:

Each line in the text file contains a data matrix row. The characteristics of these files are:

    • The matrix must be symmetric, and the diagonal elements must be zeros.
    • Within each row, the elements are separated by: spaces (‘ ’), tab character, semicolon (‘;’), comma (‘,’) or vertical bar (‘|’).
    • It is possible to include the names in an additional first row or column, but not in both.
    • If present, the labels of the nodes can not contain any of the previous separators. Some different representations for the previous matrix could be:


  • List-like file format:

Each line in the text file contains three elements, which represent the labels of two nodes and the distance (or weight) between them. The characteristics of these files are:

    • The separators between the three elements may be: spaces (‘ ’), tab character, semicolon (‘;’), comma (‘,’) or vertical bar (‘|’).
    • The labels of the nodes can not contain any of the previous separators.
    • Distances from an element to itself (e.g. “a a 0.0”) must not be included.
    • The Hierarchical Clustering applet accepts either the presence or absence of the symmetric data elements, i.e., if the distance between nodes a and b is 2.0, it is possible to include in the list the line “a b 2.0”, or both “a b 2.0” and “b a 2.0”. If both are present, the program checks if they are equal.

Examples

The following simple examples demonstrate the three list‐like files discussed above:

  • Simple list:
  • Complete list:
  • Matrix‐like with node labels in the first column, data separated by spaces:
SOCR AnalysisActivities SLR Chu 072607 Fig1.gif





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif