Why practitioners discretize their continuous data

Yihui asked this question yesterday. My supervisor Dr. Hau also criticized routine grouping discretization. I encountered two plausible reasons in 2007 classes, one negative, the other at least conditionally positive.

The first is a variant of the old Golden Hammer law -- if the only tool is ANOVA, every continuous predictor need discretization. The second reason is empirical -- ANOVA with discretization steals df(s). Let's demo it with a diagram.
The red are the population points, and the black are samples. Which predicts the population better--the green continuous line, or the discretized blue dashes? R simulation code is given.

DV predicted by two IVs, vs. triangular pyramid

-- Diagram from Wiki

It is easier to imagine relation in three spatial vectors by their angles, than by their correlations. For standardized and s , , cosines of three angles of the triangular pyramid determinate the correlation matrix, thus, all statistics of the regressions and . Unexpected but imaginative results on the impact of introducing are --

1. Both s are nearly independent of . Togethor they predict almost perfectly ( and ).

2. Both s are almost perfectly correlated with . Togethor, one of the regressive coefficient is significantly negative (, and respectively).

3. Redundancy (Cohen, Cohen, West, & Aiken, 2003) increases to full and then decreases to zero and even negative (, and closes from to then to ).

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S.  (2003). Applied multiple regression/correlation analysis for the behavioral sciences(3rd ed.) Mahwah, NJ: Lawrence Erlbaum Associates.

Anscombe’s 4 Regressions — A Trivially Updated Demo

## This is a trivially updated version based on the R document "?anscombe".
require(stats); require(graphics)

##-- now some "magic" to do the 4 regressions in a loop:##< -
ff = y ~ x
for(i in 1:4) {
ff[2:3] = lapply(paste(c("y","x"), i, sep=""), as.name)
assign(paste("lm.",i,sep=""), lmi <- lm(ff, data= anscombe))

## See how close they are (numerically!)
sapply(objects(pattern="lm\\.[1-4]$"), function(n) coef(get(n)))
function(n) coef(summary(get(n))))

## Now, do what you should have done in the first place: PLOTS
op <- par(mfrow=c(4,3),mar=.1+c(4,4,1,1), oma= c(0,0,2,0))
for(i in 1:4) {
ff[2:3] <- lapply(paste(c("y","x"), i, sep=""), as.name)
plot(ff, data =anscombe, col="red", pch=21, bg = "orange", cex = 1.2,
xlim=c(3,19), ylim=c(3,13))
abline(get(paste("lm.",i,sep="")), col="blue")
plot(lm(ff, data =anscombe),which=1,col="red", pch=21, bg = "orange", cex = 1.2
,sub.caption="",caption="" )
plot(lm(ff, data =anscombe),which=2,col="red", pch=21, bg = "orange", cex = 1.2
,sub.caption="",caption="" )
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex=1.5)


## Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21.

Understanding the nominal IV

The Popperian falsibility behind Regression Discontinuity Design (RDD)

Figure linked From http://www.socialresearchmethods.net/kb/statrd.php (Trochim, W., 2006, Figure 2). The red line is the fallacious treatment effect.

Causal analysis entails counter-factualist comparison between the treatment and the control conditions (Mark, 2003; Maris, 1998). To define a causal effect, two respective imaginary latent groups are introduced. The comparison is between identical subjects in the actual treatment group and in an imaginary control group, or vice versa. For example, student-A registered her RSS online and missed the collective entertainment these days. Student-B did not bother to register her RSS and took part in the collective entertainment. To ask whether RSS-attendance caused entertainment-skip, the causal statement means comparison between the actual A with RSS-attendance to an imaginary A without RSS-attendance, rather than the actual A to the actual B.

The full experimental design with randomization makes it sure that the two actual groups are identical in population before their treatment. The identity covers both pretest and relationship between post-test and pretest, so the mean post-test of the imaginary control group could be unbiasedly estimated From and then replaced by that of the actual observed control group, or vice versa.

Nevertheless, RDD only assumes that two actual groups are identical in relationship between post-test and pretest, plus that the relationships were modeled appropriately. It usually also assumes two groups were divided by a cutoff in pretest, while it is not necessary. In my opinion, RDD is a special instance of bi-group analysis. A typical RDD context is to teach students in accordance with their aptitude (in Chinese 因材施教).

The critical difference between full experimental design and RDD is that the identity and the model in pre-post-relationship between two actual groups is just some hypotheses to be tested by Popperian falsibility, while the population identity between groups in full experimental design is free of uncertainty by manipulated randomization. If the relationship between pretest and post-test is curvilinear or of other non-linear types, a linear regression analysis would report a fallacious treatment effect (Trochim, 2006, Figure 2).

If we have precision comparable to classic physics experiments, the relationship between pre and post tests would be shown with high Popperian falsibility. Thus, the true model is recognized without uncertainty and statistical hypothesis tests are just a surplus. Actually, we have only a typical .7 or .8 reliability in our social science measurement, and usually an approximation in true model (like RMSEA in SEM) is necessary. Then, a RDD conclusion would critically rely on the assumption of appropriate relationship modeling.

There are two conventional models to compare two groups -- Score of gain (Gain) vs residual with covariate adjustment (Cov. Adj). Moris gave discussions in depth on them (Moris, 1998). The difference between them in the Lord paradox context is well known to researchers. However, there are still a lot of confusions, some of them were cleared or tried to clear by Moris. He asserted that Regression-Toward-the-Mean and biases of Gain model do not imply one another, and that measurement errors need not be the reason of biases of Gain model. It notes that Moris explicitly stated his RTM definition is different From some version in the earlier literature (p. 322). If ubiquitousness should be a feature of RTM, the definition of Moris does not fit this criterion.

Moris pointed out that a sufficient condition for Gain model to be unbiased is that the gain scores are independent of the groups (p. 320). A more sufficient version is that gain(=posttest- pretest) scores are independent of the pretests. In figure, it equals to constant unit slopes for each regressive line. Such a relationship between posttest and pretest is more constrained than a general linear relationship for Cov. Adj., just like that the latter one is more constrained than a curvilinear relationship. Considering the low level of Popperian Falsibility in the modeling, the constraints of the relationship will be a source of controversies for researchers.


Maris, E. (1998). Covariance Adjustment Versus Gain Scores – Revisited. Psychological Methods, 3, 309-327.

Mark, M. M. (2003). Program evaluation. In Schinka, J. A. & Velicer, W. F. (Eds.), Handbook of psychology. Vol. 2: Research methods in psychology. (pp. 323-347). New York: Wiley.

Trochim, W. (2006). Regression-Discontinuity Analysis. Retrieved Sep. 15, 2007, From




去年在准备10月北师大讲座的ppt时发现自己过去对于Regression Toward the Mean Artifact (RTMA) 的概念有很多暧昧之处。比如,曾经以为把模型改进后能作无偏估计就是消除了RTMA(Li, Hau, & Marsh, 2006),而老生常谈却是:RTM无处不在。后来恍然大悟,其中问题在于有没有Artifact的主观解读。在之后另一次讲座的准备工作中,我企图澄清两种不同的“RTMA”,一种是经典的RTMA:主观认知直觉地认为预测变量的标准化z值就是被预测变量估计值的标准化z值(Galton, 1886; Kahneman & Tversky, 1973);另一种不知道是否还合适叫RTMA:研究者得到观测值正确的趋中回归描述,却错误地将这个结果推论到作为潜变量的真值,认为原因在于真值的substantial趋中偏移 (Pedhazur & Schmelkin, 1991, p. 226; Marsh & Hau, 2002)。当时隐约觉得,要解决Artifact,只要让观测者脑筋想通了就可以,并不需要特意修正模型去让观测者有问题的脑筋和模型估计结果吻合。

前月读一篇论文讨论Gain Score(Gain)模型和Covariance Adjustment Residual(Cov.Adj.)模型分别何时在因果分析时不适用(Maris, 1998),发现第二种”RTMA”在学校增值分析的场合可以解读成Gain模型和Cov.Adj.模型的选择问题。有Artifact幻觉的情形只是因为应当用 Cov.Adj.而不应当用Gain。而这种应当不应当也可以从数据和模型的是非以逻辑跷跷板的方式变成统计结果解读(Interpretation)的是非,同样的模型和数据,同样的估计结果,可能被用于恰当的解读和不恰当的解读。比如,学生入学后的Gain Score和学校录取线负相关,被解读成高录取线学校的好学生成绩回归总体均值,这种解读就和Gain模型匹配;如果解读成同样入学成绩的两个学生在不同学校成绩变动会不同,这种解读就和Gain Score模型冲突。这个Interpretation的微妙处,还是从新版Educational Measurement手册开篇的Validation章节读出来的心得(Kane, 2006)。那篇文章强调,Validation的对象是Interpretation而不是测量结果。但是翻查Marsh & Hau(2002)论文对Lord Paradox的引用部分,会发现这个心得其实早已是老生常谈。

这两周备课,想讲一些有限理性在量化方法本身的心理学原理,读到Suppes, P. 1974年回应Tversky提出的五点量表式俭约概率的公理化模型(Salsburg, 2001, p. 307),然后开始满世界找原始文献始终不获,翻Suppes纪念主页74-75年天书般的数学文献,也没有。在g scholar上只找到Wainer, H. 和Robinson, D. (2003)间接说是来自Kahneman , Slovic和Tversky的合集(1982)。然后我就借来这本大部头合集,里头对Suppes的引用只有一处,早于74年。也许Wainer和 Robinson也是和我一样读了Salsburg的八卦书然后找不着文献,就含糊了一下。不料Kahneman和Tversky的实证工作一下子吸引了我。当年Kahneman得诺奖时我还曾被兼职的单位派任务做ppt简介,那时只是翻翻新闻稿,以为就是风险、效用、经济学实验室。这下认真读原著,才发现他们是在建立统计应用(误用)的认知心理学。和Simon, H.这样百科全书式的恐龙不同,Kahneman和Tversky是很纯正的实验心理学训练,在经济学的反响实属无心插柳。正好读到forcode同学读书笔记转载的一篇ppt讲统计学不应该用数学训练方式教–那应该用什么方式教?我觉得用认知心理学的训练方式来教最适合不过了。不过这种方式的教学教材不仅中文是空白,洋文的也没怎么听说过,窃以为大有文章可作。

而我之前津津乐道的八爪外星人会用的p值或许是1/16,还有RTM和RTMA的辨析,在Kahneman和Tversky的文献背景下,都成了毫无新意的常识。说到底RTMA的A不是风动幡动的统计问题、而是仁者心动的认知问题。Kahneman和Tversky指出,预测的本能是偏执的(1973)。在我现在揣测,这种偏执可能有两种解释渠道,一种是静态的适应典型环境,偏执也许有它针对当前典型环境的理性成分;另一种可能要扯到动态演化的进化论,偏执可能促进这种典型环境的形成。这样想来,Marsh教授一直促我们研究的Matthew Effect学校假增值效应(Li, Hau, & Marsh, 2006)竟还可以这样讲大故事:人类所处环境Matthew Effect作为另一种内在的过程,对冲了趋中回归,可以解释人类预测本能的偏执。

写这篇时又翻查了Pedhazur & Schmelkin的砖头教材(1991, p. 227),发现其中已经很明确地引述了Kahneman和Tversky的文章和例子(1973)。回头再琢磨为什么Pedhazur和Schmelkin 不把两种”RTMA”区别开,现在想来第二种”RTMA”可以这么重新表述:看到了RTM,不能接受,于是要给它栽赃个外因(学校增值)或者内因(真值自己在回归)。这样说,和第一种RTMA 区别就很小了,除了多出真值vs误差这个独立于RTM的解读。Pedhazur & Schmelkin也是用误差来讲解RTM,但补充了一句没有误差照样有RTM。我觉得用经典的父代、子代高度例子(Galton, 1886)就能解释这种真值vs误差解读与RTM无关:如果把先天高度解读为真值,测量误差就是后天较大的偏离,也许RTM就可以全部或者大部分归咎于测量误差。如果把测量工具的精度效果作为误差,后天高度解读为真值,测量误差对RTM的贡献就可以忽略为零。而真值的解读毫不影响RTM的数据结果。除了多出来的真值解读之外,第二种”RTMA”比第一种还剩下的概念区别可能就太琐碎了:RTMA是非要给RTM栽赃外生原因;而如何栽赃,能不能让某种嫌疑免受栽赃,就是因果分析而不再是RTM的问题。消除、减少、避免、解决RTM(Li, Hau, & Marsh, 2006),都是错误的措辞。正确的措辞则是解读和接受RTM,解读和破解不能接受RTM的认知偏执(RTMA)。

Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246-263.

Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237-251.

Kahneman, D., Slovic, P. and Tversky, A. (1982). Judgment under uncertainty: heuristics and biases. New York: Cambridge University Press.

Kane, M. T., (2006). Validation. In Brennan, E. (Ed.), Educational measurement (4th
ed. pp. 17-64). Washington, DC: American Council on Education and National Council on Measurement in Education.

Li, X., Hau, K. & Marsh, H. W. (2006, Apr). Comparison of strategies for value-added analyses: problems of Regression Toward the Mean artifact and Matthew effect. Paper Presented at American Educational Research Association Annual Meeting, San Francisco, CA.

Maris, E. (1998). Covariance Adjustment Versus Gain Scores - Revisited. Psychological Methods, 3, 309-327.

Marsh, H. W. & Hau, K. (2002). Multilevel modeling of longitudinal growth and change: substantive effects or Regression Toward the Mean Artifacts? Multivariate Behavioral Research, 37, 245-282.

Pedhazur, E. J. & Schmelkin, L. P.(1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Association.

Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: Henry Holt & Company.

Wainer, H. & Robinson, D. H., (2003). Shaping Up the Practice of Null Hypothesis Significance Testing. Educational Researcher. 32(7). 22-30.