## Archive for the ‘**Data Analysis**’ Category

## How busy is your grocer?

*This post is part of a series on the statistical comparison of MapReduce and Parallel Databases. See the earlier posts motivating the subject and modeling failures on clusters.*

Who knew that coming up with a statistical model that fits your problem requires a bit of daydreaming, coffee and people-watching. I got my inspiration for modeling MapReduce behavior in the face of failures from my grocer, Raj. I was trying to figure out, how much free time Raj gets in between customers and is that idle time enough to say read a few pages from a book. Read the rest of this entry »

## The decay of clusters

*This post follows an earlier post motivating a statistical comparison of the performance of MapReduce and parallel databases in the face of failures.*

I rarely paid attention in 10th grade Chem, but radioactivity was too cool to sleep through and so I still remember this: The life of a radioactive, Carbon-14, nucleus is unstable and unpredictable. Eventually, it disintegrates (into a more a stable nucleus), but always with a bang (it emits radiation). We can’t predict when an atom in a lump of Carbon-14 will decay but we can predict the collective decay rate of that lump and that in about 57 hundred years, half of the Carbon-14 atoms in the lump will disintegrate. Ten years later, I see the relevance of high-school Chem to cluster computing: A cluster of machines is not too different from a Carbon-14 lump. System admins can’t really predict which machine will fail in the next second. With experience, they can say how many machines might fail in a day; they can estimate the cluster’s decay rate. Read the rest of this entry »

## Data processing in a failing world

With petabytes of data to process, we are limited to using clusters of shared-nothing parallel machines. No single machine has the memory or processing capacity to handle such amounts of data. So we divide and conquer: we divide the processing work and data across many machines. The more data we have, the more machines we throw in: we scale horizontally instead of scaling vertically (i.e. we add more commodity machines instead of using a more powerful machine with more memory, more CPU power, more disk, etc). Database systems have done this since 1990, when the first horizontally-scalable parallel database, *Gamma*, was created. Many commercial systems followed. Database systems however never scaled past a 100 machines. They didn’t need to … Read the rest of this entry »

## I do hate sums

I do hate sums.

There is no greater mistake than to call arithmetic an exact science.

There are … hidden laws of number which it requires a mind like mine to perceive.

For instance, if you add a sum from the bottom up, and then again from the top down, the result is always different.”

— Mrs. La Touche, Letters of a Noble Woman

The abacus is perhaps the oldest computational device humans used. Its primary purpose is to aid in addition. Computational devices at our disposal evolved from the first abacus (circa 2400 BC) to warehouses of thousands of computers capable of performing complex computations, simulations and data analysis. The most important operation performed by any computational device, however, is still addition. It is hard to find an operation more central to computing than repeated addition or summation. Sums are everywhere. No mean, no variance, no model can be computed without summing up some data points. When approximating irrational numbers like π or integrals of functions, we use summations. Despite our technological advances, we haven’t created an error-free summation device. The reason for this deficiency is that computers are incapable of dealing with the continuity of real numbers. They live in a discrete world, while numbers live in a continuous one. I will describe the sources of error in a single addition, how these errors add-up with summation and how to compensate for these errors. Read the rest of this entry »