# Solved – Integrating an empirical CDF

I have an empirical distribution \$G(x)\$. I calculate it as follows

``    x <- seq(0, 1000, 0.1)     g <- ecdf(var1)     G <- g(x) ``

I denote \$h(x) = dG/dx\$, i.e., \$h\$ is the pdf while \$G\$ is the cdf.

I now want to solve an equation for the upper limit of integration (say, \$a\$), such that the expected value of \$x\$ is some \$k\$.

That is, integrating from \$0\$ to \$b\$, I should have \$int xh(x)dx = k\$. I want to solve for \$b\$.

Integrating by parts, I can rewrite the equation as

\$bG(b) – int_0^b G(x)dx = k\$, where the integral is from \$0\$ to \$b\$ ——- (1)

I think I can calculate the integral as follows

``    intgrl <- function(b) {         z <- seq(0, b, 0.01)         G <- g(z)         return(mean(G))      } ``

But when I try to use this function with

``    library(rootSolve)     root <- uniroot.All(fun, c(0, 1000)) ``

where fun is eq (1), I get the following error

``    Error in seq.default(0, b, by = 0.01) : 'to' must be of length 1   ``

I think the issue is that my function `intgrl` is evaluated at a numeric value, while `uniroot.All` is passing the interval `c(0,1000)`

How should I solve for \$b\$ in this situation in R?

Contents

Let the sorted data be \$x_1 le x_2 le cdots le x_n\$. To understand the empirical CDF \$G\$, consider one of the values of the \$x_i\$–let's call it \$gamma\$–and suppose that some number \$k\$ of the \$x_i\$ are less than \$gamma\$ and \$t ge 1\$ of the \$x_i\$ are equal to \$gamma\$. Pick an interval \$[alpha, beta]\$ in which, of all the possible data values, only \$gamma\$ appears. Then, by definition, within this interval \$G\$ has the constant value \$k/n\$ for numbers less than \$gamma\$ and jumps to the constant value \$(k+t)/n\$ for numbers greater than \$gamma\$.

Consider the contribution to \$int_0^b x h(x) dx\$ from the interval \$[alpha,beta]\$. Although \$h\$ is not a function–it is a point measure of size \$t/n\$ at \$gamma\$–the integral is defined by means of integration by parts to convert it into an honest-to-goodness integral. Let's do this over the interval \$[alpha,beta]\$:

\$\$int_alpha^beta x h(x) dx = left(x G(x)right)vert_alpha^beta – int_alpha^beta G(x) dx = left(beta G(beta) – alpha G(alpha)right) -int_alpha^beta G(x) dx. \$\$

The new integrand, although it is discontinuous at \$gamma\$, is integrable. Its value is easily found by breaking the domain of integration into the parts preceding and following the jump in \$G\$:

\$\$int_alpha^beta G(x)dx = int_alpha^gamma G(alpha) dx + int_gamma^beta G(beta) dx = (gamma-alpha)G(alpha) + (beta-gamma)G(beta).\$\$

Substituting this into the foregoing and recalling \$G(alpha)=k/n, G(beta)=(k+t)/n\$ yields

\$\$int_alpha^beta x h(x) dx = left(beta G(beta) – alpha G(alpha)right) – left((gamma-alpha)G(alpha) + (beta-gamma)G(beta)right) = gammafrac{t}{n}.\$\$

In other words, this integral multiplies the location (along the \$X\$ axis) of each jump by the size of that jump. The size of the jump is

\$\$frac{t}{n} = frac{1}{n} + cdots + frac{1}{n}\$\$

with one term for each of the data values that equals \$gamma\$. Adding the contributions from all such jumps of \$G\$ shows that

\$\$int_0^b x h(x) dx = sum_{i:, 0 le x_i le b} left(x_ifrac{1}{n}right) = frac{1}{n}sum_{x_ile b}x_i.\$\$

We might call this a "partial mean," seeing that it equals \$1/n\$ times a partial sum. (Please note that it is not an expectation. It can be related to the expectation of a version of the underlying distribution that has been truncated to the interval \$[0,b]\$: you must replace the \$1/n\$ factor by \$1/m\$ where \$m\$ is the number of data values within \$[0,b]\$.)

Given \$k\$, you wish to find \$b\$ for which \$frac{1}{n}sum_{x_ile b}x_i = k.\$ Because the partial sums are a finite set of values, usually there is no solution: you will need to settle for the best approximation, which can be found by bracketing \$k\$ between two partial means, if possible. That is, upon finding \$j\$ such that

\$\$frac{1}{n}sum_{i=1}^{j-1} x_i le k lt frac{1}{n}sum_{i=1}^j x_i,\$\$

you will have narrowed \$b\$ to the interval \$[x_{j-1}, x_j)\$. You can do no better than that using the ECDF. (By fitting some continuous distribution to the ECDF you can interpolate to find an exact value of \$b\$, but its accuracy will depend on the accuracy of the fit.)

`R` performs the partial sum calculation with `cumsum` and finds where it crosses any specified value using the `which` family of searches, as in:

``set.seed(17) k <- 0.1 var1 <- round(rgamma(10, 1), 2) x <- sort(var1) x.partial <- cumsum(x) / length(x) i <- which.max(x.partial > k) cat("Upper limit lies between", x[i-1], "and", x[i]) ``

The output in this example of data drawn iid from an Exponential distribution is

Upper limit lies between 0.39 and 0.57

The true value, solving \$0.1 = int_0^b x exp(-x)dx,\$ is \$0.531812\$. Its closeness to the reported results suggests this code is accurate and correct. (Simulations with much larger datasets continue to support this conclusion).

Here is a plot of the empirical CDF \$G\$ for these data, with the estimated values of the upper limit shown as vertical dashed gray lines:

Rate this post