# Solved – Unbiased estimation of covariance matrix for multiply censored data

Chemical analyses of environmental samples are often censored below at reporting limits or various detection/quantitation limits. The latter can vary, usually in proportion to the values of other variables. For example, a sample with a high concentration of one compound might need to be diluted for analysis, resulting in proportional inflation of the censoring limits for all other compounds analyzed at the same time in that sample. As another example, sometimes the presence of a compound can alter the response of the test to other compounds (a "matrix interference"); when this is detected by the laboratory, it will inflate its reporting limits accordingly.

I am seeking a practical way to estimate the entire variance-covariance matrix for such datasets, especially when many of the compounds experience more than 50% censoring, which is often the case. A conventional distributional model is that the logarithms of the (true) concentrations are multinormally distributed, and this appears to fit well in practice, so a solution for this situation would be useful.

(By "practical" I mean a method that can reliably be coded in at least one generally available software environment like R, Python, SAS, etc., in a way that executes quickly enough to support iterative recalculations such as occur in multiple imputation, and which is reasonably stable [which is why I am reluctant to explore a BUGS implementation, although Bayesian solutions in general are welcome].)

Contents

I have not full internalized the issue of matrix interference but here is one approach. Let:

\$Y\$ be a vector that represents the concentration of all the target compounds in the undiluted sample.

\$Z\$ be the corresponding vector in the diluted sample.

\$d\$ be the dilution factor i.e., the sample is diluted \$d\$:1.

Our model is:

\$Y sim N(mu,Sigma)\$

\$Z = frac{Y}{d} + epsilon\$

where \$epsilon sim N(0,sigma^2 I)\$ represents the error due to dilution errors.

Therefore, it follows that:

\$Z sim N(frac{mu}{d}, Sigma + sigma^2 I)\$

Denote the above distribution of \$Z\$ by \$f_Z(.)\$.

Let \$O\$ be the observed concentrations and \$tau\$ represent the test instrument's threshold below which it cannot detect a compound. Then, for the \$i^{th}\$ compound we have:

\$O_i = Z_i I(Z_i > tau) + 0 I(Z_i le tau)\$

Without loss of generality let the first \$k\$ compounds be such that they are below the threshold. Then the likelihood function can be written as:

\$L(O_1, … O_k, O_{k+1},…O_n |- ) = [prod_{i=1}^{i=k}{Pr(Z_i le tau)}] [prod_{i=k+1}^{i=n}{f(O_i |-)}]\$

where

\$f(O_i |-) = int_{jneq i}{f_Z(O_i|-) I(O_i > tau)}\$

Estimation is then a matter of using either maximum likelihood or bayesian ideas. I am not sure how tractable the above is but I hope it gives you some ideas.

Rate this post