The mission
I am trying to find a way to do Iterative Proportional Fitting in R. The logic of the procedure is like this: one has a table with e.g. sample distribution of some variables. Let us say it is this one:
sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L, 44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list( c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong", "University"), c("B", "F", "W")))
Another table is from some other source, say another sample:
sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L, 954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L, 604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L ), .Dimnames = list(c("Primary", "Lowerse", "Highersec", "Highershort", "Higherlong", "University"), c("B", "F", "W")))
Now, we want to preserve the relationships between variables found in sample1
, but we want to apply these relationships to the marginal distribution we find in sample2
. Iterative Proportional Fitting does this as is described here (I could not possibly offer a better explanation). I have tried doing it in LEM, with the following result:
B F W Primary 124204.64 960173.6 637701.7 Lowerse 119749.12 1081459.0 612789.9 Highersec 336934.21 1792001.6 976107.2 Highershort 90512.27 736464.1 291816.6 Higherlong 43486.91 238593.0 119431.0 University 163186.85 418628.6 233835.5
I am not 100% sure that this result is, but chances are (say 99%) that it is. The odds ratios from the first table are preserved in the resulting table, while the marginal distributions (row and column sums) are identical to the second input table.
The problem
Strangely, this quite useful algorithm is not readily available in R, at least not in a user-friendly form. One function that is likely to be relevant is cat::ipf()
. However, I cannot figure out how to use the margins=
argument. I am certainly not alone in this question. The help example uses a 3-dimensional table, which makes things even more confusing.
In addition, there exist some user-written functions, one is here and the other is to be found here. The first one gives an erroneous result, unfortunately. The second one is also very non-transparent, requiring specifically pre-formatted CSV files as input, instead of R matrix objects.
The question
- Can anyone please explain how to actually use the
cat::ipf()
function? - Are there any alternative functions to achieve the IPF adjustment task, using matrices as input?
- (SOLVED) Can this function be fixed to deliver a proper result?
Thank you.
ADDENDUM: I was able to get a proper output from the function in (3). After some consideration in turned out that the function does not accept matrix as input for marginal distribution, but simply a list of these marginal distributions. So practically the question is solved. However, a proper answer for 1 and 2 would be of use to the larger community, since IPF is quite essential in loglinear models.
Also reproducing a working IPF function here seems like a good idea, for those who search in the future:
ipf <- function(Margins_, seedAry, maxiter=100, closure=0.001) { #Check to see if the sum of each margin is equal MarginSums. <- unlist(lapply(Margins_, sum)) if(any(MarginSums. != MarginSums.[1])) warning("sum of each margin not equal") #Replace margin values of zero with 0.001 Margins_ <- lapply(Margins_, function(x) { if(any(x == 0)) warning("zeros in marginsMtx replaced with 0.001") x[x == 0] <- 0.001 x }) #Check to see if number of dimensions in seed array equals the number of #margins specified in the marginsMtx numMargins <- length(dim(seedAry)) if(length(Margins_) != numMargins) { stop("number of margins in marginsMtx not equal to number of margins in seedAry") } #Set initial values resultAry <- seedAry iter <- 0 marginChecks <- rep(1, numMargins) margins <- seq(1, numMargins) #Iteratively proportion margins until closure or iteration criteria are met while((any(marginChecks > closure)) & (iter < maxiter)) { for(margin in margins) { marginTotal <- apply(resultAry, margin, sum) marginCoeff <- Margins_[[margin]]/marginTotal marginCoeff[is.infinite(marginCoeff)] <- 0 resultAry <- sweep(resultAry, margin, marginCoeff, "*") marginChecks[margin] <- sum(abs(1 - marginCoeff)) } iter <- iter + 1 } #If IPF stopped due to number of iterations then output info if(iter == maxiter) cat("IPF stopped due to number of iterationsn") #Return balanced array resultAry }
Note that this function will not accept a matrix for marginal distribution. The following will help:
m1 <- rowSums(sample2) m2 <- colSums(sample2) m <- list(m1,m2)
And then supply m
as the first argument.
Best Answer
This is old, but here we go:
As @Henrico wrote, it seems that what are you trying to achieve is indeed raking. Instead of using survey::rake
you might fit the distribution to the marginals "by hand" using a Poisson GLM, as suggested by @DWin. To get the right frequencies you need to use an offset.
Let
- $n_{ij}$ be your
sample1
- $N_{ij}$ the expected frequencies for the adjusted table
- $hat{N}_{ij}$ the fitted values for the adjusted table
We need to fit a model (see Little & Wu 1991 in JASA):
$$log left( frac{N_{ij}}{n_{ij}} right) = lambda + lambda^1_i + lambda^2_j$$
thus we have
$$log hat{N}_{ij} – log n_{ij} = hat{lambda} + hat{lambda^1_i} + hat{lambda^2_j}$$
where $log n_{ij}$ is the mentioned offset.
You can estimate it with any GLM software by
- Creating an artificial table of $N_{ij}$ that will satisfy independence and have the desired marginals.
- Fit a main effects log-linear/Poisson model to $N_{ij}$ with $n_{ij}$s (observed frequencies,
sample1
) as an offset. - Get the fitted values.
For example, this will get you the target frequencies f
:
# Your data sample1 <- structure(c(6L, 14L, 46L, 16L, 6L, 21L, 62L, 169L, 327L, 174L, 44L, 72L, 43L, 100L, 186L, 72L, 23L, 42L), .Dim = c(6L, 3L), .Dimnames = list( c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong", "University"), c("B", "F", "W"))) sample2 <- structure(c(171796L, 168191L, 240671L, 69168L, 60079L, 168169L, 954045L, 1040981L, 1872732L, 726410L, 207366L, 425786L, 596239L, 604826L, 991640L, 323215L, 134066L, 221696L), .Dim = c(6L, 3L ), .Dimnames = list(c("Primary", "Lowersec", "Highersec", "Highershort", "Higherlong", "University"), c("B", "F", "W"))) library(dplyr) # Turn to a data frame d1 <- as_data_frame( as.table(sample1), stringsAsFactors = FALSE) # Create artificial freqs based on sample2 and join with d1 N <- sum(sample2) d <- outer(rowSums(sample2)/N, colSums(sample2)/N) %>% as.table() %>% as_data_frame() %>% mutate( p = n / sum(n), N = round(p * sum(sample2)) ) %>% select(Var1, Var2, p, N) %>% left_join(d1) #> Joining, by = c("Var1", "Var2") # Fit the model mod <- glm( N ~ Var1 + Var2 + offset(log(n)), data=d, family=poisson("log") ) # Get the fitted values d$f <- predict(mod, type="response") d #> # A tibble: 18 x 6 #> Var1 Var2 p N n f #> <chr> <chr> <dbl> <dbl> <int> <dbl> #> 1 Primary B 0.018763534 168442 6 124197.33 #> 2 Lowersec B 0.019765059 177432 14 119743.66 #> 3 Highersec B 0.033832098 303713 46 336937.75 #> 4 Highershort B 0.012190206 109432 16 90514.26 #> 5 Higherlong B 0.004374806 39273 6 43486.57 #> 6 University B 0.008887215 79781 21 163193.43 #> 7 Primary F 0.111702426 1002761 62 960181.53 #> 8 Lowersec F 0.117664672 1056285 169 1081463.51 #> 9 Highersec F 0.201408086 1808056 327 1792009.27 #> 10 Highershort F 0.072570318 651469 174 736456.24 #> 11 Higherlong F 0.026043943 233798 44 238592.76 #> 12 University F 0.052907063 474951 72 418616.69 #> 13 Primary W 0.061364877 550877 43 637701.15 #> 14 Lowersec W 0.064640297 580281 100 612790.82 #> 15 Highersec W 0.110645603 993274 186 976095.99 #> 16 Highershort W 0.039867250 357891 72 291821.50 #> 17 Higherlong W 0.014307508 128440 23 119431.67 #> 18 University W 0.029065039 260919 42 233840.87
Similar Posts:
- Solved – What does “Conditioning on the margins of ____” mean
- Solved – How to interpret the marginal effect of a dumthe regressors in a logit model
- Solved – Exact test for m x n contingency table conditional (i.e. fixed by design) on one margin
- Solved – Iterative proportional fitting and independent variables
- Solved – What are the decision values at the margins in an SVM model