Solved – Iterative proportional fitting in R

The mission

I am trying to find a way to do Iterative Proportional Fitting in R. The logic of the procedure is like this: one has a table with e.g. sample distribution of some variables. Let us say it is this one:

sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L,  44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list(     c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong",      "University"), c("B", "F", "W"))) 

Another table is from some other source, say another sample:

sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L,  954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L,  604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L ), .Dimnames = list(c("Primary", "Lowerse", "Highersec", "Highershort",  "Higherlong", "University"), c("B", "F", "W"))) 

Now, we want to preserve the relationships between variables found in sample1, but we want to apply these relationships to the marginal distribution we find in sample2. Iterative Proportional Fitting does this as is described here (I could not possibly offer a better explanation). I have tried doing it in LEM, with the following result:

                    B         F        W Primary     124204.64  960173.6 637701.7 Lowerse     119749.12 1081459.0 612789.9 Highersec   336934.21 1792001.6 976107.2 Highershort  90512.27  736464.1 291816.6 Higherlong   43486.91  238593.0 119431.0 University  163186.85  418628.6 233835.5 

I am not 100% sure that this result is, but chances are (say 99%) that it is. The odds ratios from the first table are preserved in the resulting table, while the marginal distributions (row and column sums) are identical to the second input table.

The problem

Strangely, this quite useful algorithm is not readily available in R, at least not in a user-friendly form. One function that is likely to be relevant is cat::ipf(). However, I cannot figure out how to use the margins= argument. I am certainly not alone in this question. The help example uses a 3-dimensional table, which makes things even more confusing.

In addition, there exist some user-written functions, one is here and the other is to be found here. The first one gives an erroneous result, unfortunately. The second one is also very non-transparent, requiring specifically pre-formatted CSV files as input, instead of R matrix objects.

The question

  1. Can anyone please explain how to actually use the cat::ipf() function?
  2. Are there any alternative functions to achieve the IPF adjustment task, using matrices as input?
  3. (SOLVED) Can this function be fixed to deliver a proper result?

Thank you.

ADDENDUM: I was able to get a proper output from the function in (3). After some consideration in turned out that the function does not accept matrix as input for marginal distribution, but simply a list of these marginal distributions. So practically the question is solved. However, a proper answer for 1 and 2 would be of use to the larger community, since IPF is quite essential in loglinear models.

Also reproducing a working IPF function here seems like a good idea, for those who search in the future:

ipf <- function(Margins_, seedAry, maxiter=100, closure=0.001) {     #Check to see if the sum of each margin is equal     MarginSums. <- unlist(lapply(Margins_, sum))     if(any(MarginSums. != MarginSums.[1])) warning("sum of each margin                                                    not equal")      #Replace margin values of zero with 0.001     Margins_ <- lapply(Margins_, function(x) {         if(any(x == 0)) warning("zeros in marginsMtx replaced with                                 0.001")          x[x == 0] <- 0.001         x     })      #Check to see if number of dimensions in seed array equals the number of     #margins specified in the marginsMtx     numMargins <- length(dim(seedAry))     if(length(Margins_) != numMargins) {         stop("number of margins in marginsMtx not equal to number of              margins in seedAry")     }      #Set initial values     resultAry <- seedAry     iter <- 0     marginChecks <- rep(1, numMargins)     margins <- seq(1, numMargins)      #Iteratively proportion margins until closure or iteration criteria are met     while((any(marginChecks > closure)) & (iter < maxiter)) {         for(margin in margins) {             marginTotal <- apply(resultAry, margin, sum)             marginCoeff <- Margins_[[margin]]/marginTotal             marginCoeff[is.infinite(marginCoeff)] <- 0             resultAry <- sweep(resultAry, margin, marginCoeff, "*")             marginChecks[margin] <- sum(abs(1 - marginCoeff))         }             iter <- iter + 1     }      #If IPF stopped due to number of iterations then output info     if(iter == maxiter) cat("IPF stopped due to number of iterationsn")      #Return balanced array     resultAry     } 

Note that this function will not accept a matrix for marginal distribution. The following will help:

m1 <- rowSums(sample2) m2 <- colSums(sample2) m <- list(m1,m2) 

And then supply m as the first argument.

This is old, but here we go:

As @Henrico wrote, it seems that what are you trying to achieve is indeed raking. Instead of using survey::rake you might fit the distribution to the marginals "by hand" using a Poisson GLM, as suggested by @DWin. To get the right frequencies you need to use an offset.

Let

  • $n_{ij}$ be your sample1
  • $N_{ij}$ the expected frequencies for the adjusted table
  • $hat{N}_{ij}$ the fitted values for the adjusted table

We need to fit a model (see Little & Wu 1991 in JASA):

$$log left( frac{N_{ij}}{n_{ij}} right) = lambda + lambda^1_i + lambda^2_j$$

thus we have

$$log hat{N}_{ij} – log n_{ij} = hat{lambda} + hat{lambda^1_i} + hat{lambda^2_j}$$

where $log n_{ij}$ is the mentioned offset.

You can estimate it with any GLM software by

  1. Creating an artificial table of $N_{ij}$ that will satisfy independence and have the desired marginals.
  2. Fit a main effects log-linear/Poisson model to $N_{ij}$ with $n_{ij}$s (observed frequencies, sample1) as an offset.
  3. Get the fitted values.

For example, this will get you the target frequencies f:

# Your data sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L,                         44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list(                          c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong",                             "University"), c("B", "F", "W")))  sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L,                         954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L,                         604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L                        ), .Dimnames = list(c("Primary", "Lowersec", "Highersec", "Highershort",                                               "Higherlong", "University"), c("B", "F", "W")))  library(dplyr)  # Turn to a data frame d1 <- as_data_frame( as.table(sample1), stringsAsFactors = FALSE)  # Create artificial freqs based on sample2 and join with d1 N <- sum(sample2) d <- outer(rowSums(sample2)/N, colSums(sample2)/N) %>%   as.table() %>%   as_data_frame() %>%   mutate(     p = n / sum(n),     N = round(p * sum(sample2))     ) %>%   select(Var1, Var2, p, N) %>%   left_join(d1) #> Joining, by = c("Var1", "Var2")  # Fit the model mod <- glm( N ~ Var1 + Var2 + offset(log(n)), data=d, family=poisson("log") )  # Get the fitted values d$f <- predict(mod, type="response") d #> # A tibble: 18 x 6 #>           Var1  Var2           p       N     n          f #>          <chr> <chr>       <dbl>   <dbl> <int>      <dbl> #> 1      Primary     B 0.018763534  168442     6  124197.33 #> 2     Lowersec     B 0.019765059  177432    14  119743.66 #> 3    Highersec     B 0.033832098  303713    46  336937.75 #> 4  Highershort     B 0.012190206  109432    16   90514.26 #> 5   Higherlong     B 0.004374806   39273     6   43486.57 #> 6   University     B 0.008887215   79781    21  163193.43 #> 7      Primary     F 0.111702426 1002761    62  960181.53 #> 8     Lowersec     F 0.117664672 1056285   169 1081463.51 #> 9    Highersec     F 0.201408086 1808056   327 1792009.27 #> 10 Highershort     F 0.072570318  651469   174  736456.24 #> 11  Higherlong     F 0.026043943  233798    44  238592.76 #> 12  University     F 0.052907063  474951    72  418616.69 #> 13     Primary     W 0.061364877  550877    43  637701.15 #> 14    Lowersec     W 0.064640297  580281   100  612790.82 #> 15   Highersec     W 0.110645603  993274   186  976095.99 #> 16 Highershort     W 0.039867250  357891    72  291821.50 #> 17  Higherlong     W 0.014307508  128440    23  119431.67 #> 18  University     W 0.029065039  260919    42  233840.87 

Similar Posts:

Rate this post

Leave a Comment