# “Confidence interval of R-square”, but, which one?

In linear regression, confidence interval (CI) of population DV is narrower than that of predicted DV. With the assumption of generalizability, CI of $\tilde{Y}_{\left[1\times1\right]}$ at $x_{\left[1\times p\right]}$ is

$\;\hat{Y}\pm\left(x\left(X^{\tau}X_{\left[N\times p\right]}\right)^{-1}x^{\tau}\right)^{\frac{1}{2}}\hat{\sigma}t_{\frac{\alpha}{2},N-p}$,

while CI of $Y\left(x\right)=\tilde{Y}\left(x\right)+\varepsilon$ is

$\;\hat{Y}\pm\left(1+x\left(X^{\tau}X_{\left[N\times p\right]}\right)^{-1}x^{\tau}\right)^{\frac{1}{2}}\hat{\sigma}t_{\frac{\alpha}{2},N-p}$.

The pivot methods of both are quite similar as following.

$\;\frac{\hat{Y}-\tilde{Y}}{s_{\hat{Y}}}\sim t_{df=N-p}$ ,

so $\tilde{Y}_{critical}=\hat{Y}-s_{\hat{Y}}\times t_{critical}$ .

$\;\frac{\hat{Y}-Y}{s_{\left(\hat{Y}-Y\right)}}\sim t_{df=N-p}$,

so $Y_{critical}=\hat{Y}-s_{\left(\hat{Y}-Y\right)}\times t_{critical}=\hat{Y}-s_{\left(\hat{Y}-\tilde{Y}-\varepsilon\right)}\times t_{critical}$

$R^{2}$ of linear regression is the point estimate of

$\;\eta^{2}\equiv\frac{SS\left(\tilde{Y}_{\left[N\times1\right]}\right)}{SS\left(\tilde{Y}_{\left[N\times1\right]}\right)+N\sigma^{2}}$

for fixed IV(s) model. Or, it is the point estimate of $\rho^{2}$ wherein $\rho$ denotes the correlation of Y and $X\beta$, the linear composition of random IV(s) . The CI of $\rho^{2}$ is wider than that of $\eta^{2}$ with the same $R^{2}$ and confidence level.

[update] It is obvious that CI of $\rho^{2}$ relies on the distribution presumption of IV(s) and DV, as fixed IV(s) are just special cases of generally random IV(s). Usually, the presumption is that all IV(s) and DV are from multivariate normal distribution.

In the bivariate normal case with a single random IV, through Fisher's z-transform of Pearson's r, CI of the re-sampled $R^{\prime2}=r^{\prime2}$ can also be constructed. Intuitively, it should be wider than CI of $\rho^{2}$.

$\;\tanh^-\left(r\right)\equiv\frac{1}{2}\log\frac{1+r}{1-r}\;{appr\atop \sim}\; N\left(\tanh^-\left(\rho\right),\frac{1}{N-3}\right)$

Thus,

$\;\tanh^-\left(r^{\prime}\right)-\tanh^-\left(r\right){appr\atop \sim}N\left(0,\frac{2}{N-3}\right)$

CI of $\tanh^-\left(r^{\prime}\right)$ can be constructed as $\tanh^-\left(r\right)\pm\sqrt{\frac{2}{N-3}}z_{\frac{\alpha}{2}}$ . With the reverse transform $\tanh\left(.\right)$, the CI bounds of $R^{\prime2}$ are

$\;\left(\max\left(0,\tanh\left(\tanh^-\left(R\right)-\sqrt{\frac{2}{N-3}}z_{1-\frac{\alpha}{2}}\right)\right)\right)^{2}$

and

$\;\left(\tanh\left(\tanh^{-1}\left(R\right)+\sqrt{\frac{2}{N-3}}z_{1-\frac{\alpha}{2}}\right)\right)^{2}$.

In multiple p IV(s) case, Fisher's z-transform is

$\;\left(N-2-p\right)\left(\tanh^-\left(R\right)\right)^{2}\;{appr\atop \sim}\;\chi_{df=p,ncp=\left(N-2-p\right)\left(\tanh^-\left(\rho\right)\right)^{2}}^{2}$ .

Although it could also be used to construct CI of $\rho^{2}$ , it is inferior to noncentral F approximation of R (Lee, 1971). The latter is the algorithm adopted by MSDOS software R2 (Steiger & Fouladi, 1992) and R-function ci.R2(...) within package MBESS (Kelley, 2008).

In literature, "CI(s) of R-square" are hardly the literal CI(s) of $R^{2}$ in replication once more. Most of them actually refer to CI of $\rho^{2}$ . Authors in social science unfamiliar to $L^AT_EX$ hate to type $\rho$ when they feel convenient to type r or R. Users of experimentally designed fixed IV(s) should have reported CI of $\eta^{2}$ . However, if they were too familiar to Steiger's software R2 to ignore his series papers on CI of effect size, it would be significant chance for them to report a loose CI of $\rho^{2}$, even in a looser name "CI of $R^{2}$".

----

Lee, Y. S. (1971). Some results on the sampling distribution of the multiple correlation coefficient. Journal of the Royal Statistical Society, B, 33, 117–130.

Kelley, K. (2008). MBESS: Methods for the Behavioral, Educational, and Social Sciences. R package version 1.0.1. [Computer software]. Available from http://www.indiana.edu/~kenkel

Steiger, J. H., & Fouladi, R. T. (1992). R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation. Behavior research methods, instruments and computers, 4, 581–582.

R Code of Part I:

R Code of Part II: