Friday, January 30, 2009

Mass of a multi-dimensional cube

Recently I've been thinking a lot about multi-dimensional arrays. Don't ask me why, I can't tell you.

It's quite common to comment on the sparsity in a multi-dimensional array. In data warehousing this can be be quite helpful or the bane of your data management strategies. Keeping track of indexes and pointers can also be a royal pain.

Given a few dimensions, and knowing the expected size of any of the individual key spaces lends itself to the easy calculation of the volume of the multi-dimensional space.
You have two dimensions. Friends and Cities. You have 100 Friends. You are tracking 100 Cities. The possible combination of these keys is 100 x 100 or 10,000 cells in the space. But what if you wanted to estimate how many of your friends had visited each of the cities. This would certainly be much smaller.

So, you might estimate that most of your friends have probably visited 10 of the cities. Thus, there are only 10 x 100 cells filled in. But then, many of them may have visited the same cities, so you want to think about how many of the cities have been visited by any friends. Maybe most of the cities have been visited by 5 of your friends on average. Certainly there is a high likelihood that some of the cities haven't been visited by any of your friends.

Now, imagine that you have 6 or more dimensions. You are actually trying to model a much more complex set of keys and their combinations where not all combinations are likely to happen. Thus, you have lots of sparsity. But what about where you do have combinations?

Well, what's been racing through my mind is a visualization of the clusters, the density of the combinations and, that leads me to the notion of calculating the "mass of the multi-dimensional cube". Ok, it's really a multi-dimensional array as a cube has three dimensions, but data warehousing seems to use the cube concept.

There you have it, nothing conclusive but I thought I would coin my phrase.

1 comment:

White Rabbit said...

A provocative thought process for sure, I found myself thinking of multidimensional data mining as having a dynamic dimensionality.

The data pool established, 100 friends in 100 cities, is without criteria to sort it, theoretically a one dimensional array.

Now we want to find out how many cities your friends have visited, so we establish a new filter criteria, or in other words add another dimensional coordinate.

We have a 2-Dimensional array to hold all of the data and sort it by cities that have been visited by friends. We want to know which friends visited the same city so we add yet another dimension to the array. We are searching by 3 coordinates or criteria creating a 3 dimensional array. This continues Ad nauseam until we have successfully queried the desired data.

My theory in essence: The database has a nebulous dimensionality, until it is queried, similar to how Schrödinger's cat remains alive or dead until the box is opened.