## Why practitioners discretize their continuous data

Yihui asked this question yesterday. My supervisor Dr. Hau also criticized routine grouping discretization. I encountered two plausible reasons in 2007 classes, one negative, the other at least conditionally positive.

The first is a variant of the old Golden Hammer law -- if the only tool is ANOVA, every continuous predictor need discretization. The second reason is empirical -- ANOVA with discretization steals df(s). Let's demo it with a diagram.
The red are the population points, and the black are samples. Which predicts the population better--the green continuous line, or the discretized blue dashes? R simulation code is given.

## DV predicted by two IVs, vs. triangular pyramid

-- Diagram from Wiki

It is easier to imagine relation in three spatial vectors by their angles, than by their correlations. For standardized $DV$ $Y=\left(y_{1},y_{2},\dots,y_{N}\right)^{\tau}$ and $IV$s $X_{1}=\left(x_{1,1},x_{2,1},\dots,x_{N,1}\right)^{\tau}$, $X_{2}=\left(x_{1,2},x_{2,2},\dots,x_{N,2}\right)^{\tau}$, cosines of three angles of the triangular pyramid determinate the correlation matrix, thus, all statistics of the regressions $Y=\beta_{1}X_{1}+\beta_{2}X_{2}+\varepsilon$ and $Y=\beta_{1}X_{1}+\varepsilon$ . Unexpected but imaginative results on the impact of introducing $X_{2}$ are --

1. Both $IV$s are nearly independent of $DV$. Togethor they predict $DV$ almost perfectly ($\angle YX_{1}=\angle YX_{2}=89^{\circ}$ and $\angle X_{1}X_{2}=177.9^{\circ}$).

2. Both $IV$s are almost perfectly correlated with $DV$. Togethor, one of the regressive coefficient is significantly negative ($1^{\circ}$, $0.6^{\circ}$ and $0.5^{\circ}$ respectively).

3. Redundancy (Cohen, Cohen, West, & Aiken, 2003) increases to full and then decreases to zero and even negative ($\angle YX_{1}=60^{\circ}$, $\angle YX_{2}=45{}^{\circ}$ and $\angle X_{1}X_{2}$ closes from $90^{\circ}$ to $45^{\circ}$ then to $15^{\circ}+\epsilon$ ).

--
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S.  (2003). Applied multiple regression/correlation analysis for the behavioral sciences(3rd ed.) Mahwah, NJ: Lawrence Erlbaum Associates.

## Anscombe’s 4 Regressions — A Trivially Updated Demo

```##---------- ## This is a trivially updated version based on the R document "?anscombe". require(stats); require(graphics) anscombe```

```##-- now some "magic" to do the 4 regressions in a loop:##< - ff = y ~ x for(i in 1:4) { ff[2:3] = lapply(paste(c("y","x"), i, sep=""), as.name) assign(paste("lm.",i,sep=""), lmi <- lm(ff, data= anscombe)) }```

``` ## See how close they are (numerically!) sapply(objects(pattern="lm\\.[1-4]\$"), function(n) coef(get(n))) lapply(objects(pattern="lm\\.[1-4]\$"), function(n) coef(summary(get(n)))) ## Now, do what you should have done in the first place: PLOTS op <- par(mfrow=c(4,3),mar=.1+c(4,4,1,1), oma= c(0,0,2,0)) for(i in 1:4) { ff[2:3] <- lapply(paste(c("y","x"), i, sep=""), as.name) plot(ff, data =anscombe, col="red", pch=21, bg = "orange", cex = 1.2, xlim=c(3,19), ylim=c(3,13)) abline(get(paste("lm.",i,sep="")), col="blue") plot(lm(ff, data =anscombe),which=1,col="red", pch=21, bg = "orange", cex = 1.2 ,sub.caption="",caption="" ) plot(lm(ff, data =anscombe),which=2,col="red", pch=21, bg = "orange", cex = 1.2 ,sub.caption="",caption="" ) } mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex=1.5) par(op) ## ```

`## Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21.`

## The Popperian falsibility behind Regression Discontinuity Design (RDD)

Figure linked From http://www.socialresearchmethods.net/kb/statrd.php (Trochim, W., 2006, Figure 2). The red line is the fallacious treatment effect.

Causal analysis entails counter-factualist comparison between the treatment and the control conditions (Mark, 2003; Maris, 1998). To define a causal effect, two respective imaginary latent groups are introduced. The comparison is between identical subjects in the actual treatment group and in an imaginary control group, or vice versa. For example, student-A registered her RSS online and missed the collective entertainment these days. Student-B did not bother to register her RSS and took part in the collective entertainment. To ask whether RSS-attendance caused entertainment-skip, the causal statement means comparison between the actual A with RSS-attendance to an imaginary A without RSS-attendance, rather than the actual A to the actual B.

The full experimental design with randomization makes it sure that the two actual groups are identical in population before their treatment. The identity covers both pretest and relationship between post-test and pretest, so the mean post-test of the imaginary control group could be unbiasedly estimated From and then replaced by that of the actual observed control group, or vice versa.

Nevertheless, RDD only assumes that two actual groups are identical in relationship between post-test and pretest, plus that the relationships were modeled appropriately. It usually also assumes two groups were divided by a cutoff in pretest, while it is not necessary. In my opinion, RDD is a special instance of bi-group analysis. A typical RDD context is to teach students in accordance with their aptitude (in Chinese 因材施教).

The critical difference between full experimental design and RDD is that the identity and the model in pre-post-relationship between two actual groups is just some hypotheses to be tested by Popperian falsibility, while the population identity between groups in full experimental design is free of uncertainty by manipulated randomization. If the relationship between pretest and post-test is curvilinear or of other non-linear types, a linear regression analysis would report a fallacious treatment effect (Trochim, 2006, Figure 2).

If we have precision comparable to classic physics experiments, the relationship between pre and post tests would be shown with high Popperian falsibility. Thus, the true model is recognized without uncertainty and statistical hypothesis tests are just a surplus. Actually, we have only a typical .7 or .8 reliability in our social science measurement, and usually an approximation in true model (like RMSEA in SEM) is necessary. Then, a RDD conclusion would critically rely on the assumption of appropriate relationship modeling.

There are two conventional models to compare two groups -- Score of gain (Gain) vs residual with covariate adjustment (Cov. Adj). Moris gave discussions in depth on them (Moris, 1998). The difference between them in the Lord paradox context is well known to researchers. However, there are still a lot of confusions, some of them were cleared or tried to clear by Moris. He asserted that Regression-Toward-the-Mean and biases of Gain model do not imply one another, and that measurement errors need not be the reason of biases of Gain model. It notes that Moris explicitly stated his RTM definition is different From some version in the earlier literature (p. 322). If ubiquitousness should be a feature of RTM, the definition of Moris does not fit this criterion.

Moris pointed out that a sufficient condition for Gain model to be unbiased is that the gain scores are independent of the groups (p. 320). A more sufficient version is that gain(=posttest- pretest) scores are independent of the pretests. In figure, it equals to constant unit slopes for each regressive line. Such a relationship between posttest and pretest is more constrained than a general linear relationship for Cov. Adj., just like that the latter one is more constrained than a curvilinear relationship. Considering the low level of Popperian Falsibility in the modeling, the constraints of the relationship will be a source of controversies for researchers.

--

Maris, E. (1998). Covariance Adjustment Versus Gain Scores – Revisited. Psychological Methods, 3, 309-327.

Mark, M. M. (2003). Program evaluation. In Schinka, J. A. & Velicer, W. F. (Eds.), Handbook of psychology. Vol. 2: Research methods in psychology. (pp. 323-347). New York: Wiley.

Trochim, W. (2006). Regression-Discontinuity Analysis. Retrieved Sep. 15, 2007, From
http://www.socialresearchmethods.net/kb/statrd.php

## RTMA背后的认知偏执

[横轴是预测变量，纵轴是被预测变量；已知预测变量截于蓝线红线绿线位置。蓝线红线相加等于绿线，红箭嘴是被预测变量统计无偏估计；红箭起点是本能偏执预测，红箭表示趋中回归程度。图摘自2006/10北师大讲座PPT]

Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237-251.

Kahneman, D., Slovic, P. and Tversky, A. (1982). Judgment under uncertainty: heuristics and biases. New York: Cambridge University Press.

Kane, M. T., (2006). Validation. In Brennan, E. (Ed.), Educational measurement (4th
ed. pp. 17-64). Washington, DC: American Council on Education and National Council on Measurement in Education.

Li, X., Hau, K. & Marsh, H. W. (2006, Apr). Comparison of strategies for value-added analyses: problems of Regression Toward the Mean artifact and Matthew effect. Paper Presented at American Educational Research Association Annual Meeting, San Francisco, CA.

Maris, E. (1998). Covariance Adjustment Versus Gain Scores - Revisited. Psychological Methods, 3, 309-327.

Marsh, H. W. & Hau, K. (2002). Multilevel modeling of longitudinal growth and change: substantive effects or Regression Toward the Mean Artifacts? Multivariate Behavioral Research, 37, 245-282.

Pedhazur, E. J. & Schmelkin, L. P.(1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Association.

Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: Henry Holt & Company.

Wainer, H. & Robinson, D. H., (2003). Shaping Up the Practice of Null Hypothesis Significance Testing. Educational Researcher. 32(7). 22-30.

p.s.发现原先教案里的Kahneman都错拼成Khaneman了。