Building Blocks for Global Data Quality Success

About the data quality success, I found this information in the MSDN magazine of december 2013 edition.

  • Address Verification
  • Phone Verification
  • Email Verification
  • Rooftop Geocoding
  • Name Parsing and Genderizing
  • Full Identity Verification

k-means Clustering

The k-means algorithm, a straightforward and widely used clustering algorithm. Given a set of objects (records), the goal of clustering or segmentation is to divide these objects into groups or “clusters” such that objects within a group tend to be more similar to one another as compared to objects belonging to different groups. In other words, clustering algorithms place similar points in the same cluster while placing dissimilar points in different clusters.Note that,in contrast to supervised tasks such as regression or classification where there is a notion of a target value or class label, the objects that form the inputs to a clustering procedure do not come with an associated target. Therefore, clustering is often referred to as unsupervised learning. Because there is no need for labeled data, unsupervised algorithms are suitable for many applications where labeled data is difficult to obtain. Unsupervised tasks such as clustering are also often used to explore and characterize the dataset before running a supervised learning task. Since clustering makes no use of class labels, some notion of similarity must be defined based on the attributes of the objects. The definition of similarity and the method in which points are clustered differ based on the clustering algorithm being applied. Thus, different clustering algorithms are suited to different types of datasets and different purposes. The “best” clustering algorithm to use therefore depends on the application. It is not uncommon to try several different algorithms and choose depending on which is the most useful.

   1: Input: Dataset D, number clusters k

   2: Output: Set of cluster representatives C, cluster membership vector m

   3:     /* Initialize cluster representatives C */

   4:     Randomly choose k data points from D

   5: 5: Use these k points as initial set of cluster representatives C

   6:     repeat

   7:         /* Data Assignment */

   8:         Reassign points in D to closest cluster mean

   9:         Update m such that mi is cluster ID of ith point in D

  10: 10: /* Relocation of means */

  11:         Update C such that cj is mean of points in jth cluster

  12: until convergence of objective function summation(i=1 to N)(argminj||xi −cj||2 2)


We know what the prediction is. In short, the prediction is to some information which is likely to occur in upcoming future. There are lot of fortune tellers that would tell us the future. They may occur or may not occur. We won’t believe what they say. But in the computational world, the computer telling the future about something is likely to be believed. As here the fortune is told by the systems by analyzing the past data and records. Some of the successful prediction is done in forecasting the weather. Despite of their different nature, the prediction is done with some knowledge about the past elements and may be some other available information. So the thing is how is this prediction done? and here is one basic version of the sequential prediction problem.

Basically in the sequential prediction problem, the forecaster studies the elements of the sequence and guesses the next element of the sequence on the basis of the previous observation. In the classical statistical theory of the sequential prediction, the elements are assumed to be a realization of the stationary stochastic process. So in this the properties of the process is estimated on the basis of the past observation. The risk of the prediction rules can be derived from some of the loss calculating function which measures from the differences between the predicted value and the outcomes.

Without the proper probabilistic modeling, the idea of risk cannot be defined at all. Many possibilities may exist . in our basic model the performance of the forecaster is measured by the loss accumulated during many predictions.The loss is calculated by the help of some fixed loss function. To provide the better baseline, the reference forecaster is introduced. They make their forecasting before the next outcome is revealed. The forecaster can make their own outcome and take the reference forecaster outcome as an advice. This results in forecaster take their loss records to get the best outcome in further forecasting.