The tail(s) of p value


For any given vs , the p value of any given point x is , Where

-- See R. Weber's Statistics Note (Chap 6.2 & 7.1)

I made some wrong comment on the pdf Null Ritual (Gigerenzer, Krauss, & Vitouch, 2004) Where three types of significance level (rather than p value) were discussed. I had written the comment to note that the chapter had ignored the role of in definition of p value. In almost every textbook, the two-tail p vs single-tail p are differentiated. Usually, the two-tail p is defined by like .

Here I demonstrate a three-tail p value case on R platform.


z=(-1000:1000)*0.02;
f=0.5 * dchisq(abs(z),df=5);
h=dchisq(10,df=5)*.5;
plot(z,f,type="h",col=c("black","grey")[1+(f>h)]);
lines(c(-20,20),c(h,h));
## is * binomial(-1 vs 1) ##

Do you agree the region nearby zero under the "V" curve (which is below the horizontal line) should be the 3rd tail? I think so, if only includes all other possible distributions in the same shape.

You'll also agree there will be two asymmetrical tails if includes just two asymmetrical curves, for example, and () while is the standardized normal distribution.

相关系数的几何:对截距投影的残差向量之间交角余弦


一直马虎地以为两个列向量的内积就是它们所代表变量的相关系数,结果今天在学生面前出了一回丑,企图让一列常数和另一个列向量的相关系数接近1。大家都知道,一列常数和任何一个列向量的相关系数必定为零。

我的错误在于忘记了协方差表达式中,列向量作内积之前有一步中心化:减去全列的均值。被减去的实际上是一个向量,等于全列均值乘以向量,也就是在截距向量、也就是“对角线”轴方向上的投影。减去这个投影,是没有任何解释变量、只有截距项时的回归残差,这个残差向量和截距方向垂直,所以落在垂直于“对角线”截距向量(日晷指针)的线性子空间里(日晷盘)。协方差实际上是这样的两个残差向量内积,而相关系数就是两残差向量之间的夹角Cosine值。

The Popperian falsibility behind Regression Discontinuity Design (RDD)


Figure linked From http://www.socialresearchmethods.net/kb/statrd.php (Trochim, W., 2006, Figure 2). The red line is the fallacious treatment effect.

Causal analysis entails counter-factualist comparison between the treatment and the control conditions (Mark, 2003; Maris, 1998). To define a causal effect, two respective imaginary latent groups are introduced. The comparison is between identical subjects in the actual treatment group and in an imaginary control group, or vice versa. For example, student-A registered her RSS online and missed the collective entertainment these days. Student-B did not bother to register her RSS and took part in the collective entertainment. To ask whether RSS-attendance caused entertainment-skip, the causal statement means comparison between the actual A with RSS-attendance to an imaginary A without RSS-attendance, rather than the actual A to the actual B.

The full experimental design with randomization makes it sure that the two actual groups are identical in population before their treatment. The identity covers both pretest and relationship between post-test and pretest, so the mean post-test of the imaginary control group could be unbiasedly estimated From and then replaced by that of the actual observed control group, or vice versa.

Nevertheless, RDD only assumes that two actual groups are identical in relationship between post-test and pretest, plus that the relationships were modeled appropriately. It usually also assumes two groups were divided by a cutoff in pretest, while it is not necessary. In my opinion, RDD is a special instance of bi-group analysis. A typical RDD context is to teach students in accordance with their aptitude (in Chinese 因材施教).

The critical difference between full experimental design and RDD is that the identity and the model in pre-post-relationship between two actual groups is just some hypotheses to be tested by Popperian falsibility, while the population identity between groups in full experimental design is free of uncertainty by manipulated randomization. If the relationship between pretest and post-test is curvilinear or of other non-linear types, a linear regression analysis would report a fallacious treatment effect (Trochim, 2006, Figure 2).

If we have precision comparable to classic physics experiments, the relationship between pre and post tests would be shown with high Popperian falsibility. Thus, the true model is recognized without uncertainty and statistical hypothesis tests are just a surplus. Actually, we have only a typical .7 or .8 reliability in our social science measurement, and usually an approximation in true model (like RMSEA in SEM) is necessary. Then, a RDD conclusion would critically rely on the assumption of appropriate relationship modeling.

There are two conventional models to compare two groups -- Score of gain (Gain) vs residual with covariate adjustment (Cov. Adj). Moris gave discussions in depth on them (Moris, 1998). The difference between them in the Lord paradox context is well known to researchers. However, there are still a lot of confusions, some of them were cleared or tried to clear by Moris. He asserted that Regression-Toward-the-Mean and biases of Gain model do not imply one another, and that measurement errors need not be the reason of biases of Gain model. It notes that Moris explicitly stated his RTM definition is different From some version in the earlier literature (p. 322). If ubiquitousness should be a feature of RTM, the definition of Moris does not fit this criterion.

Moris pointed out that a sufficient condition for Gain model to be unbiased is that the gain scores are independent of the groups (p. 320). A more sufficient version is that gain(=posttest- pretest) scores are independent of the pretests. In figure, it equals to constant unit slopes for each regressive line. Such a relationship between posttest and pretest is more constrained than a general linear relationship for Cov. Adj., just like that the latter one is more constrained than a curvilinear relationship. Considering the low level of Popperian Falsibility in the modeling, the constraints of the relationship will be a source of controversies for researchers.

--

Maris, E. (1998). Covariance Adjustment Versus Gain Scores – Revisited. Psychological Methods, 3, 309-327.

Mark, M. M. (2003). Program evaluation. In Schinka, J. A. & Velicer, W. F. (Eds.), Handbook of psychology. Vol. 2: Research methods in psychology. (pp. 323-347). New York: Wiley.

Trochim, W. (2006). Regression-Discontinuity Analysis. Retrieved Sep. 15, 2007, From

http://www.socialresearchmethods.net/kb/statrd.php

惊喜:wordpress.com缺省支持latex

本来还打算全部转移到yo2.cnhttp://lixiaoxu.lxxm.com,现在不用了:)

试一下效果

--

有位同学反馈看我的 lixiaoxu.wordpress.com 很久不出公式的图片。另一位不在深圳的同学干脆连看都看不到。从这种情况看来,使用wordpress.com的同学很可能都是费了牛劲才把笔记贴上去 的,要上传什么文件更艰难。我暂不在境内,没有体验到这么痛苦的连国外网速,给大家作了不恰当的推荐,非常抱歉。

为了方便境内的访问,我的学习笔记转移到了lixiaoxu.yo2.cn这里

yo2.cn如果要显示公式需要在后台启用安装插件。大家可以看我启用后的效果,用先写公式然后copy的。

lxxm.com基于wordpress mu平台,可以定制缺省启用的插件。这个wordpress mu插件基于John Forkosh的mimetex cgi

回复:关于“伪小数定律”的脚注2

之所以贴为主贴是因为baidu博客报告回复超长,不允许发。原帖子见deadwind学习笔记博客

---

这篇文献(Tversky & Kahneman, 1971)我只是大致概览,但发在science的那篇综述也说到了这类问题(Tversky & Kahneman, 1974)。代表性的偏见被他们认为是本能。对比有限理性的其它心理学研究,我猜想代表性的偏见是由人类现实的思维方式决定,而统计的估计基于无限理性的理想假设。

我感觉读文献引发思考有两大类。有一类属于技术层面的问题:这个文献通过什么实验设计支持一个什么样的idea,作者如何有这个idea而别人却没有。这类问题关注文献的思路与科研技巧,学习到的东西比较实在,容易取得学术共识。另一类思考关注文献的研究对象和思考论题本身,以及相关联的各种背景。这类思考能对文献的阅读提供很强的兴奋感激励,也容易激起讨论气氛,不过学习到的东西不够实在和直接。我上面的猜想属于第二类的问题。抽象地说,第一类问题是认识论问题,第二类问题是宇宙论问题。在认识论问题上,有可能进行说服——用听者的逻辑和立场去说服听者;而宇宙论问题,太容易变成说教--用讲者的逻辑和立场去说服听者。

言归正传,解释脚注(在pdf原文里是脚注2):20个样本,z值是2.23,.05一类错误双尾z检验显著了;如果再新抽10次样本,问卷请研究者主观估计这十个样本0.05一类错误的单尾z检验显著的可能性。

因为是z检验,所以总体的标准差已知。因为这个问题和计量尺度没有关系,变换尺度,就能让。只有均值被检验。按频率学派的观点,不给定就不能知道检验显著的(频率学派)概率(Gigerenzer, Krauss, & Vitouch, 2004)。但研究者必须回答一个主观的可能性。有一类研究者就会把这个主观的可能性等同于某种中立情况下的频率学派概率,他们把这种中立情况选为的真值恰好是第一次20个样本对的无偏估计值。

用Excel计算,第一次无偏估计值的绝对值;我们的问题和的正负方向无关,不妨认为第一次估计值为正数。单尾.05的z值=NORMINV(1-0.05,0,1)。十个样本单尾显著的判决区域是:十个样本的均值/对应的标准差真值 >NORMINV(0.95,0,1)。十个样本均值是个统计量,这个统计量的分布方差真值是 ,标准差真值则是,这个统计量分布的均值真值就是

P(十个样本的均值 >(1/SQRT(10))*NORMINV(0.95,0,1) | 真值=,十个样本的均值抽样分布标准差真值=,用Excel算=1-NORMDIST(NORMINV(1-0.05,0,1)/sqrt(10),2.23/SQRT(20),1/sqrt(10),TRUE)
从这个脚注的案例,可以体味一下所谓的Power Analysis对真分布的知识的依赖,而在标准的频率学派框架里,真分布是永远不知道的,连真分布满足某种特定范围的概率也不知道。Gigerenzer, Krauss, & Vitouch的Chapter(2004)值得细读,打算列为第二次(一共十六次)课的必读文献。

--
Gigerenzer, G., Krauss, S., & Vitouch, O., (2004). The null ritual: What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan, (ed.), The Sage handbook of quantitative methodology for the social sciences. (pp. 391–408). Thousand
Oaks, CA: Sage.

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.

Tversky, A. & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185, 1124-1131.

“不争论”的智慧

昨日备课去读Neapolitan&Morris(2004)的关于贝叶斯统计的文章,读到其中一句, ...used (physical probability) as if they exist but without philosophical commitment...,忽然发现自己在教案中准备了很多关于统计和概率的通识(或所谓哲学)背景,却忘了强调概率统计学者的智慧恰恰在于规避哲学争论、专注于精深的技术共识创新。

“不争论” 不只是学术智慧,也是政治智慧。下图是“不争论是我的一个发明”的语录作者--

图相关的原文《邓小平为什么提倡不争论》见于

--
Neapolitan, R, E., & Morris, S. (2004). Probabilistic modeling with Bayesian networks. In D. Kaplan (Ed.), The Sage Handbook of Quantitative Methodology for the Social Sciences (pp. 371-390). Thousand Oaks, CA: Sage.

RTMA背后的认知偏执

[横轴是预测变量,纵轴是被预测变量;已知预测变量截于蓝线红线绿线位置。蓝线红线相加等于绿线,红箭嘴是被预测变量统计无偏估计;红箭起点是本能偏执预测,红箭表示趋中回归程度。图摘自2006/10北师大讲座PPT]

去年在准备10月北师大讲座的ppt时发现自己过去对于Regression Toward the Mean Artifact (RTMA) 的概念有很多暧昧之处。比如,曾经以为把模型改进后能作无偏估计就是消除了RTMA(Li, Hau, & Marsh, 2006),而老生常谈却是:RTM无处不在。后来恍然大悟,其中问题在于有没有Artifact的主观解读。在之后另一次讲座的准备工作中,我企图澄清两种不同的“RTMA”,一种是经典的RTMA:主观认知直觉地认为预测变量的标准化z值就是被预测变量估计值的标准化z值(Galton, 1886; Kahneman & Tversky, 1973);另一种不知道是否还合适叫RTMA:研究者得到观测值正确的趋中回归描述,却错误地将这个结果推论到作为潜变量的真值,认为原因在于真值的substantial趋中偏移 (Pedhazur & Schmelkin, 1991, p. 226; Marsh & Hau, 2002)。当时隐约觉得,要解决Artifact,只要让观测者脑筋想通了就可以,并不需要特意修正模型去让观测者有问题的脑筋和模型估计结果吻合。

前月读一篇论文讨论Gain Score(Gain)模型和Covariance Adjustment Residual(Cov.Adj.)模型分别何时在因果分析时不适用(Maris, 1998),发现第二种”RTMA”在学校增值分析的场合可以解读成Gain模型和Cov.Adj.模型的选择问题。有Artifact幻觉的情形只是因为应当用 Cov.Adj.而不应当用Gain。而这种应当不应当也可以从数据和模型的是非以逻辑跷跷板的方式变成统计结果解读(Interpretation)的是非,同样的模型和数据,同样的估计结果,可能被用于恰当的解读和不恰当的解读。比如,学生入学后的Gain Score和学校录取线负相关,被解读成高录取线学校的好学生成绩回归总体均值,这种解读就和Gain模型匹配;如果解读成同样入学成绩的两个学生在不同学校成绩变动会不同,这种解读就和Gain Score模型冲突。这个Interpretation的微妙处,还是从新版Educational Measurement手册开篇的Validation章节读出来的心得(Kane, 2006)。那篇文章强调,Validation的对象是Interpretation而不是测量结果。但是翻查Marsh & Hau(2002)论文对Lord Paradox的引用部分,会发现这个心得其实早已是老生常谈。

这两周备课,想讲一些有限理性在量化方法本身的心理学原理,读到Suppes, P. 1974年回应Tversky提出的五点量表式俭约概率的公理化模型(Salsburg, 2001, p. 307),然后开始满世界找原始文献始终不获,翻Suppes纪念主页74-75年天书般的数学文献,也没有。在g scholar上只找到Wainer, H. 和Robinson, D. (2003)间接说是来自Kahneman , Slovic和Tversky的合集(1982)。然后我就借来这本大部头合集,里头对Suppes的引用只有一处,早于74年。也许Wainer和 Robinson也是和我一样读了Salsburg的八卦书然后找不着文献,就含糊了一下。不料Kahneman和Tversky的实证工作一下子吸引了我。当年Kahneman得诺奖时我还曾被兼职的单位派任务做ppt简介,那时只是翻翻新闻稿,以为就是风险、效用、经济学实验室。这下认真读原著,才发现他们是在建立统计应用(误用)的认知心理学。和Simon, H.这样百科全书式的恐龙不同,Kahneman和Tversky是很纯正的实验心理学训练,在经济学的反响实属无心插柳。正好读到forcode同学读书笔记转载的一篇ppt讲统计学不应该用数学训练方式教–那应该用什么方式教?我觉得用认知心理学的训练方式来教最适合不过了。不过这种方式的教学教材不仅中文是空白,洋文的也没怎么听说过,窃以为大有文章可作。

而我之前津津乐道的八爪外星人会用的p值或许是1/16,还有RTM和RTMA的辨析,在Kahneman和Tversky的文献背景下,都成了毫无新意的常识。说到底RTMA的A不是风动幡动的统计问题、而是仁者心动的认知问题。Kahneman和Tversky指出,预测的本能是偏执的(1973)。在我现在揣测,这种偏执可能有两种解释渠道,一种是静态的适应典型环境,偏执也许有它针对当前典型环境的理性成分;另一种可能要扯到动态演化的进化论,偏执可能促进这种典型环境的形成。这样想来,Marsh教授一直促我们研究的Matthew Effect学校假增值效应(Li, Hau, & Marsh, 2006)竟还可以这样讲大故事:人类所处环境Matthew Effect作为另一种内在的过程,对冲了趋中回归,可以解释人类预测本能的偏执。

写这篇时又翻查了Pedhazur & Schmelkin的砖头教材(1991, p. 227),发现其中已经很明确地引述了Kahneman和Tversky的文章和例子(1973)。回头再琢磨为什么Pedhazur和Schmelkin 不把两种”RTMA”区别开,现在想来第二种”RTMA”可以这么重新表述:看到了RTM,不能接受,于是要给它栽赃个外因(学校增值)或者内因(真值自己在回归)。这样说,和第一种RTMA 区别就很小了,除了多出真值vs误差这个独立于RTM的解读。Pedhazur & Schmelkin也是用误差来讲解RTM,但补充了一句没有误差照样有RTM。我觉得用经典的父代、子代高度例子(Galton, 1886)就能解释这种真值vs误差解读与RTM无关:如果把先天高度解读为真值,测量误差就是后天较大的偏离,也许RTM就可以全部或者大部分归咎于测量误差。如果把测量工具的精度效果作为误差,后天高度解读为真值,测量误差对RTM的贡献就可以忽略为零。而真值的解读毫不影响RTM的数据结果。除了多出来的真值解读之外,第二种”RTMA”比第一种还剩下的概念区别可能就太琐碎了:RTMA是非要给RTM栽赃外生原因;而如何栽赃,能不能让某种嫌疑免受栽赃,就是因果分析而不再是RTM的问题。消除、减少、避免、解决RTM(Li, Hau, & Marsh, 2006),都是错误的措辞。正确的措辞则是解读和接受RTM,解读和破解不能接受RTM的认知偏执(RTMA)。


Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246-263.

Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237-251.

Kahneman, D., Slovic, P. and Tversky, A. (1982). Judgment under uncertainty: heuristics and biases. New York: Cambridge University Press.

Kane, M. T., (2006). Validation. In Brennan, E. (Ed.), Educational measurement (4th
ed. pp. 17-64). Washington, DC: American Council on Education and National Council on Measurement in Education.

Li, X., Hau, K. & Marsh, H. W. (2006, Apr). Comparison of strategies for value-added analyses: problems of Regression Toward the Mean artifact and Matthew effect. Paper Presented at American Educational Research Association Annual Meeting, San Francisco, CA.

Maris, E. (1998). Covariance Adjustment Versus Gain Scores - Revisited. Psychological Methods, 3, 309-327.

Marsh, H. W. & Hau, K. (2002). Multilevel modeling of longitudinal growth and change: substantive effects or Regression Toward the Mean Artifacts? Multivariate Behavioral Research, 37, 245-282.

Pedhazur, E. J. & Schmelkin, L. P.(1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Association.

Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: Henry Holt & Company.

Wainer, H. & Robinson, D. H., (2003). Shaping Up the Practice of Null Hypothesis Significance Testing. Educational Researcher. 32(7). 22-30.

p.s.发现原先教案里的Kahneman都错拼成Khaneman了。