## Why practitioners discretize their continuous data

Yihui asked this question yesterday. My supervisor Dr. Hau also criticized routine grouping discretization. I encountered two plausible reasons in 2007 classes, one negative, the other at least conditionally positive.

The first is a variant of the old Golden Hammer law -- if the only tool is ANOVA, every continuous predictor need discretization. The second reason is empirical -- ANOVA with discretization steals df(s). Let's demo it with a diagram.
The red are the population points, and the black are samples. Which predicts the population better--the green continuous line, or the discretized blue dashes? R simulation code is given.

## Type III ANOVA in R

Type III ANOVA SS for factor A within interaction of factor B is defined as $SS_{A:B+A+B}-SS_{A:B+B}$, wherein A:B  is the pure interaction effect orthogonal to main effects of A, B, and intercept. There are some details in R to get pure interaction dummy IV(s).

Data is from SAS example PROC GLM, Example 30.3: Unbalanced ANOVA for Two-Way Design with Interaction

##
##Data from http://www.otago.ac.nz/sas/stat/chap30/sect52.htm
##
drug <- as.factor(c(t(t(rep(1,3)))%*%t(1:4))); ##Factor A
disease <- as.factor(c(t(t(1:3)) %*% t(rep(1,4))));##Factor B
y <- t(matrix(c(
42 ,44 ,36 ,13 ,19 ,22
,33 ,NA ,26 ,NA ,33 ,21
,31 ,-3 ,NA ,25 ,25 ,24
,28 ,NA ,23 ,34 ,42 ,13
,NA ,34 ,33 ,31 ,NA ,36
,3 ,26 ,28 ,32 ,4 ,16
,NA ,NA ,1 ,29 ,NA ,19
,NA ,11 ,9 ,7 ,1 ,-6
,21 ,1 ,NA ,9 ,3 ,NA
,24 ,NA ,9 ,22 ,-2 ,15
,27 ,12 ,12 ,-5 ,16 ,15
,22 ,7 ,25 ,5 ,12 ,NA
),nrow=6));
## verify data with http://www.otago.ac.nz/sas/stat/chap30/sect52.htm
(cbind(drug,disease,y));
##
## make a big table
y <- c(y);
drug <- rep(drug,6);
disease <- rep(disease,6);
##
## Design the PURE interaction dummy variables
m <- model.matrix(lm(rep(0,length(disease)) ~ disease + drug +disease:drug));
##! If lm(y~ ...) is used, the is.na(y) rows will be dropped. The residuals will be orthogonal to observed A, & B rather than designed cell A & B. It will be Type II SS rather than Type III SS.
c <- attr(m,"assign")==3;
(IV_Interaction <-residuals( lm(m[,c] ~ m[,!c])));
##
## verify data through type I & II ANOVA to http://www.otago.ac.nz/sas/stat/chap30/sect52.htm
## Type I ANOVA of A, defined by SS_A --
anova(lm(y~drug*disease));
##
## Type II ANOVA of A, defined by SS_{A+B}-SS_B --
require(car);
Anova(lm(y~drug*disease),type='II');
anova(lm(y~disease),lm(y~drug + disease))
##
##
## Type III ANOVA of A defined by SS_{A:B+A+B}-SS_{A:B+B}
t(t(c( anova(lm(y~IV_Interaction+disease),lm(y~disease * drug))$'Sum of Sq'[2] ,anova(lm(y~IV_Interaction+drug),lm(y~disease*drug))$'Sum of Sq'[2]
,anova(lm(y~disease+drug),lm(y~disease*drug))$'Sum of Sq'[2]))) ## ## Currently, Anova(...) of Prof John Fox's car package (V. 1.2-8 or 1.2-9) used "impure" interaction dummy IV(s), which made its type III result relying upon the order of factor levels. I think in its next version, the "pure" interaction dummy IV(s) will be adopted to give consistent type III SS. [update:] In Prof John FOX's car package, with parameter contrasts in inputted lm object, Example(Anova) gave type III SS consistent to other softwares. In this case, the code line should be -- Anova(lm(y~drug*disease, contrasts=list(drug=contr.sum, disease=contr.sum)),type='III'); Contrasts patterns are defined within lm(...) rather than Anova(...). An lm object with default contrasts parameter is inappropriate to calculate type III SS, or the result will rely on the level names in any nominal factor -- require(car); M2<-Moore; M2$f1<-M2$fcategory; M2$f2<-as.factor(- as.integer(M2$fcategory)); mod1<-lm(formula = conformity ~ f1 * partner.status,data=M2); mod2<-lm(formula = conformity ~ f2 * partner.status,data=M2); c(Anova(mod1,type='III')$'Sum Sq'[3],Anova(mod2,type='III')\$'Sum Sq'[3])

There was hot discussion of type III ANOVA on R-help newsgroup. Thomas Lumley thought Types of SS nowadays don't have to make any real sense --

http://tolstoy.newcastle.edu.au/R/help/05/04/3009.html

This is one of many examples of an attempt to provide a mathematical answer to something that isn't a mathematical question.

As people have already pointed out, in any practical testing situation you have two models you want to compare. If you are working in an interactive statistical environment, or even in a modern batch-mode system, you can fit the two models and compare them. If you want to compare two other models, you can fit them and compare them.

However, in the Bad Old Days this was inconvenient (or so I'm told). If you had half a dozen tests, and one of the models was the same in each test, it was a substantial saving of time and effort to fit this model just once.

This led to a system where you specify a model and a set of tests: eg I'm going to fit y~a+b+c+d and I want to test (some of) y~a vs y~a+b, y~a+b vs y~a+b+c and so on. Or, I want to test (some of) y~a+b+c vs y~a+b+c+d, y~a+b+d vs y~a+b+c+d and so on. This gives the "Types" of sums of squares, which are ways of specifying sets of tests. You could pick the "Type" so that the total number of linear models you had to fit was minimized. As these are merely a computational optimization, they don't have to make any real sense. Unfortunately, as with many optimizations, they have gained a life of their own.

The "Type III" sums of squares are the same regardless of order, but this is a bad property, not a good one. The question you are asking when you test "for" a term X really does depend on what other terms are in the model, so order really does matter. However, since you can do anything just by specifying two models and comparing them, you don't actually need to worry about any of this.

-thomas