What is Big Data? and What does it look like?

If we want the definition of the ‘Big data’ its some sort of the data that can’t be processed by the conventional database systems. In fact, the data is big, moving very fast and doesn’t fit the structures of the database architectures. In order to get information from this ‘Big Data’ we need to choose such an option that could process it. Recent breakthrough in IT ‘Big data’ is something like new tech. A normal people  thinks of it simply as the data that is big. Well, we cannot say it is wrong. As it is such data that requires the processing to get some information with it.

We all know that, today in this world the data has become a viable. As the data is simply massive, thus many cost effective approach has emerged out to tame the volume variability and velocity. And the question remains, What is the value of Big Data in an organization? The answer is simple, The value of it falls on the 2 categories. One is analytical use and other is enabling new products.

Talking about the past decades, all the successful web startups are the examples of the big data. The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook.

As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be from social networks, web server, traffic flow sensors, satellite imagery, audio and video  streams, banking transactions, MP3s of any music, contents of web pages, scans of any documents, GPS trails, telemetry from the automobiles, financial market data, and much more which leads the list to go on. So are these all the same thing? To clarify this issue, the three Vs of volume, velocity, and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Building Blocks for Global Data Quality Success

About the data quality success, I found this information in the MSDN magazine of december 2013 edition.

  • Address Verification
  • Phone Verification
  • Email Verification
  • Rooftop Geocoding
  • Name Parsing and Genderizing
  • Full Identity Verification

k-means Clustering

The k-means algorithm, a straightforward and widely used clustering algorithm. Given a set of objects (records), the goal of clustering or segmentation is to divide these objects into groups or “clusters” such that objects within a group tend to be more similar to one another as compared to objects belonging to different groups. In other words, clustering algorithms place similar points in the same cluster while placing dissimilar points in different clusters.Note that,in contrast to supervised tasks such as regression or classification where there is a notion of a target value or class label, the objects that form the inputs to a clustering procedure do not come with an associated target. Therefore, clustering is often referred to as unsupervised learning. Because there is no need for labeled data, unsupervised algorithms are suitable for many applications where labeled data is difficult to obtain. Unsupervised tasks such as clustering are also often used to explore and characterize the dataset before running a supervised learning task. Since clustering makes no use of class labels, some notion of similarity must be defined based on the attributes of the objects. The definition of similarity and the method in which points are clustered differ based on the clustering algorithm being applied. Thus, different clustering algorithms are suited to different types of datasets and different purposes. The “best” clustering algorithm to use therefore depends on the application. It is not uncommon to try several different algorithms and choose depending on which is the most useful.

   1: Input: Dataset D, number clusters k

   2: Output: Set of cluster representatives C, cluster membership vector m

   3:     /* Initialize cluster representatives C */

   4:     Randomly choose k data points from D

   5: 5: Use these k points as initial set of cluster representatives C

   6:     repeat

   7:         /* Data Assignment */

   8:         Reassign points in D to closest cluster mean

   9:         Update m such that mi is cluster ID of ith point in D

  10: 10: /* Relocation of means */

  11:         Update C such that cj is mean of points in jth cluster

  12: until convergence of objective function summation(i=1 to N)(argminj||xi −cj||2 2)