Big Data and Business Analytics: Commentaries on Willful Ignorance

The basic ideas of the book Willful Ignorance are:

1st) Uncertainty has two dimensions: of doubt and of ambiguity.

Doubt is a matter that can be directly dealt with probability and statistical testing methods as we know them today. It is a matter of belonging or not to a certain category.

Ambiguity on the other hand is matter of defining the correct categories where the data should be classified (or not).

Doubt is the territory of researchers and Ambiguity pertains to the practitioners in general. In other words: Ambiguity deals with finding the correct question. Doubt is our measure of certainty that our answer is right. On the other hand, we must remind ourselves that it is better to have an imperfect answer for a correct question than a perfect answer for the wrong question.

2nd) The statistical methods available today were developed in an era of Small Data. At that time (which is not so far from us in the history, for instance ,in the first and most of second half of XX century) a lot of work was done before a small table, with let´s say a thousand numbers put together. The ambiguity should be resolved by the time the table was finished. Therefore it was ready for statistical analysis as we know them today (hypothesis testing for instance). In this case, all was left was to solve the doubt issues.

Nowadays we live in an era of Big Data where large chunks of bits and bytes can be produced each second. In this world, lots of relationships will appear between the data pieces themselves, BUT, most of them will have been produced by noise not by real relationships. In this brave new world validation techniques are so important as the statistical techniques that we have used to far.

3rd) All these techniques will try to provide a new kind of validation that will come from replication of the results. The first proposal is to separate the data in at least two subsets: learning and testing. In the learning sub-set it would be applied the statistical techniques to get, for instance, the regression coefficients of a model. These models would than be put to test in a different sub-set (the so called testing set) and the results would than be confirmed or not.

I´d say that you could devise a meta-technique which would start dividing the whole group of data in several sub-sets (for instance ten). Each of them would generate a model (a linear regression model for instance) and their coefficients would be recorded creating another table. The data in this second table could than be subjected to a new hypothesis testing where we would try finding evidence that they are different from each other. In case we could not prove this difference (for instance with an ANOVA test) we would proceed for a second round of validation where each model would have its prediction power tested against the other (in this case nine) sub-sets.

At this moment I can´t provide a numerical rule of thumb for approving (or not) a model based on this approach, but this can be the basis for a research program on validation techniques

4th) Other approaches that were mentioned n-fold cross validation, bootstrap and jackknife approaches. I would add Monte Carlo Simulation, although I´m not in the position (right now) of providing methodological guidelines for using Monte Carlo Simulation in validation designs.

Good luck, so far,

Gustavo,

Big Data and Business Analytics

Sunday, February 22, 2015

Commentaries on Willful Ignorance

No comments:

Post a Comment