Understanding QQ plots

## Try distributions like rchisq, rt, runif, rf to view its heavy, or light, left, or right tail.

n <- 30;
ry <- rnorm(n);
##view and guess what are x(s) and y(s)
I <- rep(1,n);
qr <- ((ry%*%t(I) > I %*% t(ry))+.5*(ry %*% t(I) == I%*%t(ry)))%*%I *(1/n);##qr are the sample quantiles
points(qr,ry,col="blue"); ##to view the fact, try the following
rx <- qnorm(qr);
##Red O(s) circle black o(s) exactly.

03DEC2007 R-workshop sponsored by dept of psy, ZSU(=SYSU, Guang-Zhou)

Here is the updated PPT for the talk in the afternoon--which includes the zipped example codes and set-up steps for the workshop in the evening within the 3rd page. The listed anonymous on-line test (result statistics) on p-value interpretation was cited indirectly From Gigerenzer, Krauss, & Vitouch (2004).

There is an advert on http://www.psy.sysu.edu.cn/detail_news.asp?id=258 and a formal CV of the speaker is available on http://lixiaoxu.googlepageS.com

Classic Neyman-Pearson approach demo

It notes here that N-P approach does not utilize the information in the accurate p value. Actually, at the time N-P approach was firstly devised, the accurate p value was not available usually. Now almost all statistic softwares provide accurate p values and the N-P approach becomes obsolete. Wilkinson & APA TFSI (1999) recommended to report the accurate p value rather than just significance/insignificance, unless p is smaller than any meaningful precision.


Wilkinson, L. & APA TFSI (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Different corr(s) of different IV scopes with same regression coef

Y=\alpha+\beta X+\varepsilon,\;\varepsilon\sim N\left(0,\sigma^{2}\right)

With \alpha,\beta, \sigma known in the linear relationship, can the correlation in the scatter plot of Y against X be estimated from the linear formula?

You may recall in Hierarchical Linear Model class, the scopes of the W dramatically impact the regression coefficients of F~W in the following R demo (hlm.jpg). While this time the regression coefficient has been fixed to a known \beta. So the scopes of X would never impact the regression coefficient. However, it proved that the correlation r could range from zero to unit (or -1) according to the variance of X in the final close form r=\frac{\beta\mbox{Var}\left(X\right)}{\mbox{Std}\left(Y\right)\mbox{Std}\left(X\right)}=\beta\frac{\mbox{Std}\left(X\right)}{\sqrt{\beta^{2}\mbox{Var}\left(X\right)+\sigma^{2}}}.

Let me quote as the final words from Cohen (1994; p.1001; Where the role of IV is replaced by that of DV within typical contexts like ANOVA) --

... standardized effect size measures, such as d and f, developed in power analysis (Cohen, 1988) are, like correlations, also dependent on population variability of the dependent variable and are properly used only when that fact is kept in mind.


Cohen, J. (1994). The earth is round (p<.05). American Psychologist, 49, 997-1003.


Compare to the following case: different corr(s) of different IV scopes with hierarchical regression coefficients --

“Effect Size” — same data, different interpretations

Just a short R-script note to embody the 3-page-paper of Rosenthal & Rubin (1982).

Table 1. (p. 167) listed a simple set-up. There was a between-subject treatment. Control group includes 34 alive cases and 66 dead cases. Treatment group includes 66 alive cases and 34 dead cases. The question is what is the percentage of the variance explained by the nominal IV indicating the group?

The authors pointed out that one may interpret the data result as death rate was reduced by 32% while the other may interpret the same as 10.24% variance was explained. Let's demo it more dramatically to imagine just 4% explained variance would reduce death rate by 20%.


Rosenthal, R. & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.

Anscombe’s 4 Regressions — A Trivially Updated Demo

## This is a trivially updated version based on the R document "?anscombe".
require(stats); require(graphics)

##-- now some "magic" to do the 4 regressions in a loop:##< -
ff = y ~ x
for(i in 1:4) {
ff[2:3] = lapply(paste(c("y","x"), i, sep=""), as.name)
assign(paste("lm.",i,sep=""), lmi <- lm(ff, data= anscombe))

## See how close they are (numerically!)
sapply(objects(pattern="lm\\.[1-4]$"), function(n) coef(get(n)))
function(n) coef(summary(get(n))))

## Now, do what you should have done in the first place: PLOTS
op <- par(mfrow=c(4,3),mar=.1+c(4,4,1,1), oma= c(0,0,2,0))
for(i in 1:4) {
ff[2:3] <- lapply(paste(c("y","x"), i, sep=""), as.name)
plot(ff, data =anscombe, col="red", pch=21, bg = "orange", cex = 1.2,
xlim=c(3,19), ylim=c(3,13))
abline(get(paste("lm.",i,sep="")), col="blue")
plot(lm(ff, data =anscombe),which=1,col="red", pch=21, bg = "orange", cex = 1.2
,sub.caption="",caption="" )
plot(lm(ff, data =anscombe),which=2,col="red", pch=21, bg = "orange", cex = 1.2
,sub.caption="",caption="" )
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex=1.5)


## Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21.


这是《相关系数的几何:对截距投影的残差向量之间交角余弦》示意图,恰好可以用于解释为什么 \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}满足的 \chi^2分布dfn-1而不是n

其中 X_{i}\equiv\mu+\varepsilon_{i} \left[\begin{array}{c}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\\varepsilon_{n}\end{array}\right]n维空间中的标准正态随机向量。那么,容易知道有 \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sum_{i=1}^{n}\left(\varepsilon{}_{i}-\bar{\varepsilon}\right)^{2}。这个表达式就是向量 \left[\begin{array}{c}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\\varepsilon_{n}\end{array}\right]-\left[\begin{array}{c}\bar{\varepsilon}\\\bar{\varepsilon}\\\vdots\\\bar{\varepsilon}\end{array}\right]长度的平方。我们已经知道, \left[\begin{array}{c}\bar{\varepsilon}\\\bar{\varepsilon}\\\vdots\\\bar{\varepsilon}\end{array}\right]就是 \left[\begin{array}{c}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\\varepsilon_{n}\end{array}\right]在截距向量(日晷指针) \left[\begin{array}{c}1\\1\\\vdots\\1\end{array}\right]上的投影。自然, \left[\begin{array}{c}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\\varepsilon_{n}\end{array}\right]-\left[\begin{array}{c}\bar{\varepsilon}\\\bar{\varepsilon}\\\vdots\\\bar{\varepsilon}\end{array}\right]就是对截距项投影残差向量,也就是在日晷盘上的投影。

日晷所处空间的n是3。如果我们对 \left[\begin{array}{c}\varepsilon_{1}\\\varepsilon_{2}\\\varepsilon_{3}\end{array}\right]抽样许多次,就会看到三维空间中各个方向对称的标准正态分布散点图。这些散点图在日晷盘上的投影就是二维空间标准正态分布散点图。日晷盘中这些点对应向量的长度平方自然是 \chi^2_{df=2}的抽样。


一个研究者每次都先看一下计算出的统计量再决定对零假设 \mu=0做单尾检验还是双尾检验。如果统计量 \bar{X}>0,就设对立假设为 \mu>0;如果统计量 \bar{X}<0,就设对立假设为 \mu<0。假如他的 \alpha=0.05请问他真实的一类错误率是多少?具体说,有许多次的实验,真实情形都是 \mu=0,他能检验出显著拒绝的比例会趋近于多少?







Knight原著并不易读。甚至只是翻查《新帕尔格雷夫经济学大辞典》1987版1996中译本的UncertaintyKnight辞条,就已经令人云里雾里。其中Knight辞条执笔者是G. J. Stigler,他对Knight在Uncertainty上的“贡献”略有微词。Knight原著第7章注解1也小心的指出他打算规避认识论/知识论的讨论。这给我的感觉就好比:讨论一个被定义为“本质上不可讨论的对象”的对象。须知Uncertainty在Knight原著中唯一的内涵就是不可测度,于是所有对它的减少(eliminate)都是对它的否定。一旦比较它有多么地“不可测度”,就是在否定“不可测度”的本质。从罗素悖论的经验,我实在怀疑“不可测度性”程度的比较注定要引出悖论。



Knight, F. H. (1921). Risk, Uncertainty, and Profit. Boston, MA: Hart, Schaffner & Marx.

Understanding the nominal IV