k-means Clustering

The k-means algorithm, a straightforward and widely used clustering algorithm. Given a set of objects (records), the goal of clustering or segmentation is to divide these objects into groups or “clusters” such that objects within a group tend to be more similar to one another as compared to objects belonging to different groups. In other words, clustering algorithms place similar points in the same cluster while placing dissimilar points in different clusters.Note that,in contrast to supervised tasks such as regression or classification where there is a notion of a target value or class label, the objects that form the inputs to a clustering procedure do not come with an associated target. Therefore, clustering is often referred to as unsupervised learning. Because there is no need for labeled data, unsupervised algorithms are suitable for many applications where labeled data is difficult to obtain. Unsupervised tasks such as clustering are also often used to explore and characterize the dataset before running a supervised learning task. Since clustering makes no use of class labels, some notion of similarity must be defined based on the attributes of the objects. The definition of similarity and the method in which points are clustered differ based on the clustering algorithm being applied. Thus, different clustering algorithms are suited to different types of datasets and different purposes. The “best” clustering algorithm to use therefore depends on the application. It is not uncommon to try several different algorithms and choose depending on which is the most useful.

   1: Input: Dataset D, number clusters k

   2: Output: Set of cluster representatives C, cluster membership vector m

   3:     /* Initialize cluster representatives C */

   4:     Randomly choose k data points from D

   5: 5: Use these k points as initial set of cluster representatives C

   6:     repeat

   7:         /* Data Assignment */

   8:         Reassign points in D to closest cluster mean

   9:         Update m such that mi is cluster ID of ith point in D

  10: 10: /* Relocation of means */

  11:         Update C such that cj is mean of points in jth cluster

  12: until convergence of objective function summation(i=1 to N)(argminj||xi −cj||2 2)

C4.5 Classification

C4.5 is a suite of algorithms for classification problems in machine learning and data mining. It is targeted at supervised learning: Given an attribute-valued dataset where instances are described by collections of attributes and belong to one of a set of mutually exclusive classes, C4.5 learns a mapping from attribute values to classes that can be applied to classify new, unseen instances.C4.5, designed by J. Ross Quinlan, is so named because it is a descendant of the ID3 approach to inducing decision trees,which in turn is the third incarnation in a series of “iterative dichotomizers.” A decision tree is a series of questions systematically arranged so that each question queries an attribute and branches based on the value of the attribute. At the leaves of the tree are placed predictions of the class variable. The algorithm is given below.

   1: Input: an attribute-valued dataset D 

   2: Tree ={}

   3: if D is “pure” OR other stopping criteria met then

   4: terminate

   5: end if

   6: for all attribute a ∈ D do

   7: Compute information-theoretic criteria if we split on a

   8: end for

   9: abest = Best attribute according to above computed criteria

  10: Tree = Create a decision node that tests abest in the root

  11: Dv = Induced sub-datasets from D based on abest

  12: for all Dv do

  13: Treev = C4.5(Dv)

  14: Attach Treev to the corresponding branch of Tree

  15: end for

  16: return Tree