## What to expect with failures?

*This is the last post in a series comparing MapReduce to Parallel Databases using stochastic models. See earlier posts for an introduction, cluster failure modeling and MapReduce failure modeling.*

Parallel databases or MapReduce, your technology of choice for data processing on clusters will depend on performance. The technology that delivers a response to your information quests first wins. We could argue on the relative importance of performance vs. ease of use, but at the end of the day, faster is stronger. This is the zeitgeist of the 21st century. As Daft Punk sings it: “Work it harder, make it better; *Do it faster, makes us stronger*; More than ever, Hour after; Our work is never over.” So putting this debate to rest by quoting lyrics from a pop-album, lets settle which technology is faster. Read the rest of this entry »

## How busy is your grocer?

*This post is part of a series on the statistical comparison of MapReduce and Parallel Databases. See the earlier posts motivating the subject and modeling failures on clusters.*

Who knew that coming up with a statistical model that fits your problem requires a bit of daydreaming, coffee and people-watching. I got my inspiration for modeling MapReduce behavior in the face of failures from my grocer, Raj. I was trying to figure out, how much free time Raj gets in between customers and is that idle time enough to say read a few pages from a book. Read the rest of this entry »

## The decay of clusters

*This post follows an earlier post motivating a statistical comparison of the performance of MapReduce and parallel databases in the face of failures.*

I rarely paid attention in 10th grade Chem, but radioactivity was too cool to sleep through and so I still remember this: The life of a radioactive, Carbon-14, nucleus is unstable and unpredictable. Eventually, it disintegrates (into a more a stable nucleus), but always with a bang (it emits radiation). We can’t predict when an atom in a lump of Carbon-14 will decay but we can predict the collective decay rate of that lump and that in about 57 hundred years, half of the Carbon-14 atoms in the lump will disintegrate. Ten years later, I see the relevance of high-school Chem to cluster computing: A cluster of machines is not too different from a Carbon-14 lump. System admins can’t really predict which machine will fail in the next second. With experience, they can say how many machines might fail in a day; they can estimate the cluster’s decay rate. Read the rest of this entry »

## Data processing in a failing world

With petabytes of data to process, we are limited to using clusters of shared-nothing parallel machines. No single machine has the memory or processing capacity to handle such amounts of data. So we divide and conquer: we divide the processing work and data across many machines. The more data we have, the more machines we throw in: we scale horizontally instead of scaling vertically (i.e. we add more commodity machines instead of using a more powerful machine with more memory, more CPU power, more disk, etc). Database systems have done this since 1990, when the first horizontally-scalable parallel database, *Gamma*, was created. Many commercial systems followed. Database systems however never scaled past a 100 machines. They didn’t need to … Read the rest of this entry »

## Information Theory: 400 years in the making.

Mathematics is a language. We use it to describe and quantify things. Our first exposure to the language is when we learn to describe counts of things: one apple, two cats, three dogs, etc. Later in life, we use Mathematics innocuously: when we order a pizza, we order a certain diameter – 16″. Our sub-concious mathematician, visualizes the area of the pizza as π*(16″/2)². It then splits the pizza eight-ways and figures out that we probably need another large pizza to feed the guests on game night. In our day-to-day lives, this deductive language is never spoken except when it renders a result (we need another pizza) or succinctly describes an event (a 16″ pizza). We are out of practice when it comes to communicating with each other using our innate mathematical language. And so like a student that learns French grammar for a year and is dropped in Paris, a student with less than a year of college calculus finds himself incapable of communicating more than his name and awkwardly revealing how English is his first (and only) language when dropped into a graduate Maths course. Read the rest of this entry »

## I do hate sums

I do hate sums.

There is no greater mistake than to call arithmetic an exact science.

There are … hidden laws of number which it requires a mind like mine to perceive.

For instance, if you add a sum from the bottom up, and then again from the top down, the result is always different.”

— Mrs. La Touche, Letters of a Noble Woman

The abacus is perhaps the oldest computational device humans used. Its primary purpose is to aid in addition. Computational devices at our disposal evolved from the first abacus (circa 2400 BC) to warehouses of thousands of computers capable of performing complex computations, simulations and data analysis. The most important operation performed by any computational device, however, is still addition. It is hard to find an operation more central to computing than repeated addition or summation. Sums are everywhere. No mean, no variance, no model can be computed without summing up some data points. When approximating irrational numbers like π or integrals of functions, we use summations. Despite our technological advances, we haven’t created an error-free summation device. The reason for this deficiency is that computers are incapable of dealing with the continuity of real numbers. They live in a discrete world, while numbers live in a continuous one. I will describe the sources of error in a single addition, how these errors add-up with summation and how to compensate for these errors. Read the rest of this entry »