Solved – Integrating an empirical CDF

I have an empirical distribution $G(x)$. I calculate it as follows

    x <- seq(0, 1000, 0.1)     g <- ecdf(var1)     G <- g(x) 

I denote $h(x) = dG/dx$, i.e., $h$ is the pdf while $G$ is the cdf.

I now want to solve an equation for the upper limit of integration (say, $a$), such that the expected value of $x$ is some $k$.

That is, integrating from $0$ to $b$, I should have $int xh(x)dx = k$. I want to solve for $b$.

Integrating by parts, I can rewrite the equation as

$bG(b) – int_0^b G(x)dx = k$, where the integral is from $0$ to $b$ ——- (1)

I think I can calculate the integral as follows

    intgrl <- function(b) {         z <- seq(0, b, 0.01)         G <- g(z)         return(mean(G))      } 

But when I try to use this function with

    library(rootSolve)     root <- uniroot.All(fun, c(0, 1000)) 

where fun is eq (1), I get the following error

    Error in seq.default(0, b, by = 0.01) : 'to' must be of length 1   

I think the issue is that my function intgrl is evaluated at a numeric value, while uniroot.All is passing the interval c(0,1000)

How should I solve for $b$ in this situation in R?

Let the sorted data be $x_1 le x_2 le cdots le x_n$. To understand the empirical CDF $G$, consider one of the values of the $x_i$–let's call it $gamma$–and suppose that some number $k$ of the $x_i$ are less than $gamma$ and $t ge 1$ of the $x_i$ are equal to $gamma$. Pick an interval $[alpha, beta]$ in which, of all the possible data values, only $gamma$ appears. Then, by definition, within this interval $G$ has the constant value $k/n$ for numbers less than $gamma$ and jumps to the constant value $(k+t)/n$ for numbers greater than $gamma$.

ECDF

Consider the contribution to $int_0^b x h(x) dx$ from the interval $[alpha,beta]$. Although $h$ is not a function–it is a point measure of size $t/n$ at $gamma$–the integral is defined by means of integration by parts to convert it into an honest-to-goodness integral. Let's do this over the interval $[alpha,beta]$:

$$int_alpha^beta x h(x) dx = left(x G(x)right)vert_alpha^beta – int_alpha^beta G(x) dx = left(beta G(beta) – alpha G(alpha)right) -int_alpha^beta G(x) dx. $$

The new integrand, although it is discontinuous at $gamma$, is integrable. Its value is easily found by breaking the domain of integration into the parts preceding and following the jump in $G$:

$$int_alpha^beta G(x)dx = int_alpha^gamma G(alpha) dx + int_gamma^beta G(beta) dx = (gamma-alpha)G(alpha) + (beta-gamma)G(beta).$$

Substituting this into the foregoing and recalling $G(alpha)=k/n, G(beta)=(k+t)/n$ yields

$$int_alpha^beta x h(x) dx = left(beta G(beta) – alpha G(alpha)right) – left((gamma-alpha)G(alpha) + (beta-gamma)G(beta)right) = gammafrac{t}{n}.$$

In other words, this integral multiplies the location (along the $X$ axis) of each jump by the size of that jump. The size of the jump is

$$frac{t}{n} = frac{1}{n} + cdots + frac{1}{n}$$

with one term for each of the data values that equals $gamma$. Adding the contributions from all such jumps of $G$ shows that

$$int_0^b x h(x) dx = sum_{i:, 0 le x_i le b} left(x_ifrac{1}{n}right) = frac{1}{n}sum_{x_ile b}x_i.$$

We might call this a "partial mean," seeing that it equals $1/n$ times a partial sum. (Please note that it is not an expectation. It can be related to the expectation of a version of the underlying distribution that has been truncated to the interval $[0,b]$: you must replace the $1/n$ factor by $1/m$ where $m$ is the number of data values within $[0,b]$.)

Given $k$, you wish to find $b$ for which $frac{1}{n}sum_{x_ile b}x_i = k.$ Because the partial sums are a finite set of values, usually there is no solution: you will need to settle for the best approximation, which can be found by bracketing $k$ between two partial means, if possible. That is, upon finding $j$ such that

$$frac{1}{n}sum_{i=1}^{j-1} x_i le k lt frac{1}{n}sum_{i=1}^j x_i,$$

you will have narrowed $b$ to the interval $[x_{j-1}, x_j)$. You can do no better than that using the ECDF. (By fitting some continuous distribution to the ECDF you can interpolate to find an exact value of $b$, but its accuracy will depend on the accuracy of the fit.)


R performs the partial sum calculation with cumsum and finds where it crosses any specified value using the which family of searches, as in:

set.seed(17) k <- 0.1 var1 <- round(rgamma(10, 1), 2) x <- sort(var1) x.partial <- cumsum(x) / length(x) i <- which.max(x.partial > k) cat("Upper limit lies between", x[i-1], "and", x[i]) 

The output in this example of data drawn iid from an Exponential distribution is

Upper limit lies between 0.39 and 0.57

The true value, solving $0.1 = int_0^b x exp(-x)dx,$ is $0.531812$. Its closeness to the reported results suggests this code is accurate and correct. (Simulations with much larger datasets continue to support this conclusion).

Here is a plot of the empirical CDF $G$ for these data, with the estimated values of the upper limit shown as vertical dashed gray lines:

Figure of ECDF

Similar Posts:

Rate this post

Leave a Comment