Let's say my data consists of 25 observations of four features, which are in groups of 2. So we'll call my variables $x_1, x_2, y_1, y_2$. We have a partitioned sample mean vector given by $begin{bmatrix}bar{x}\ bar{y}end{bmatrix}$, where $bar{x} = begin{bmatrix}bar{x_1}\ bar{x_2}end{bmatrix}$, $bar{y} = begin{bmatrix}bar{y_1}\ bar{y_2}end{bmatrix}$. We can partition the covariance matrix $S$ by
begin{equation*}
S = begin{bmatrix}S_{xx}&S_{yx}\S_{xy}&S_{yy}end{bmatrix}
end{equation*}
where $S_{xx}$ is the covariance matrix of just the $x$ variables and $S_{yy}$ is the covariance matrix of just the $y$ variables. I'm trying to write an R script that will calculate $S_{yx}$ using just the input data. Clearly $S_{yx}$ consists of
begin{equation*}
S_{yx} = begin{bmatrix}Cov(x_1, y_1)&Cov(x_1, y_2)\Cov(x_2, y_1)&Cov(x_2, y_2)end{bmatrix}
end{equation*}
So if $X$ is the $25 times 2$ matrix of $x$ data and $Y$ is the $25 times 2$ matrix of $Y$ data, then my idea for calculating is is like so:
begin{equation*}
S_{yx} = frac{1}{n-1} big(X^TY – nbar{x}bar{y}^Tbig)
end{equation*}
However, when I type into R:
(1/(n-1)) * ((t(X) %*% Y) - (n * xbar %*% t(ybar)))
I get results which are wildly different from the corresponding results in the full covariance matrix. Is there a problem with my reasoning or my R code?
EDIT: ftp://ftp.wiley.com/public/sci_tech_med/multivariate_analysis_3e/ My input is the "T3_8_SONS.DAT" file in this directory. I don't know if there is a better way to include my input, sorry.
Best Answer
You can make your life a lot easier by using R's covariance function cov
. So, all you need to do is read in the data to a dataframe, change the order of the columns, and then compute the covariance matrix. The partitions can then be plucked from the main covariance matrix. Here is some code to get $S$, $Sxx$, $Syx$.
> #Read in Data > mydf<-read.table("D:\T3_8_SONS.DAT") > names(mydf)<-c("y1", "y2", "x1", "x2") > #Change column order so S is created in the same order as in OP's question > mydf2<-data.frame(mydf$x1, mydf$x2, mydf$y1, mydf$y2) > names(mydf2)<-c("x1", "x2", "y1", "y2") > #Print to compare to Table 3.8 in book > head(mydf2) x1 x2 y1 y2 1 179 145 191 155 2 201 152 195 149 3 185 149 181 148 4 188 149 183 153 5 171 142 176 144 6 192 152 208 157 > #Obtain full variance covariance matrix > S <- cov(mydf2) > #Obtain covariance partitioned matrices > Sxx <- S[1:2,1:2] > Syy <- S[3:4,3:4] > Syx <- Sxy <- S[1:2, 3:4] > S x1 x2 y1 y2 x1 100.80667 56.54000 69.66167 51.31167 x2 56.54000 45.02333 46.11167 35.05333 y1 69.66167 46.11167 95.29333 52.86833 y2 51.31167 35.05333 52.86833 54.36000 > Sxx x1 x2 x1 100.8067 56.54000 x2 56.5400 45.02333 > Syy y1 y2 y1 95.29333 52.86833 y2 52.86833 54.36000 > Syx y1 y2 x1 69.66167 51.31167 x2 46.11167 35.05333
We could use your method, which is mathematically correct, to obtain $Syx$ too, but it's more work. Here are the alternate calculations:
> #Your method > n<-25 > xbar<-apply(mydf2, 2, mean)[1:2] > ybar<-apply(mydf2, 2, mean)[3:4] > Syx2<-(t(mydf2[1:2])%*%as.matrix(mydf2[3:4])-n*(xbar%*%t(ybar)))/(n-1) > Syx2 y1 y2 x1 69.66167 51.31167 x2 46.11167 35.05333 >
Note that the $Syx$ here matches the computations I provided previously for $Syx$.
Similar Posts:
- Solved – Expectation of a matrix for variance-covariance
- Solved – Find cointegrating vectors and loadings from a trivariate VAR(1) equation
- Solved – Understanding Covariance Matrix- Formula
- Solved – least squares coefficient estimates in calculus and matrix calculus
- Solved – In linear regression, are the noise terms independent of the coefficient estimators