I recently posted a question with many parts and I'd like to focus in on just one issue that I didn't emphasize in the original post.
My data is a list of records, each one representing an educational seminar event. I have a continuous variable that represents the revenue brought in by each seminar, which is the response variable in my regression. I also have a number of categorical variables which are acting as factors/IVs.
One of those categorical factors is the speaker hosting the event. The trouble is that sometimes more than one speaker hosts a particular event. To date, all our speakers have been drawn from a pool of 154. Most of the time, just one speaker is used, but in about 10% of the data points, two, three or even four speakers were used. Currently, this is represented in my data with slashes ("Speaker One / Speaker Two / Speaker Three"). I've written a Python script that can find the average revenue on a given date interval for seminars which take a given level of a categorical variable (for example, it could return the average revenue for all seminars in 2008 that for which Speaker One was the host)…my script can read the multiple speaker format fine, reading names on opposite sides of a " / " as separate speakers.
Unfortunately, R doesn't seem to be able to do anything like that…I've run a multiple regression on my data and obviously it treats "Speaker One", "Speaker Two" and "Speaker One / Speaker Two" as three different speakers. My multiple r-squared value is less than 0.5 so I'm hoping that resolving this issue would help improve the model…how best to proceed?
Two models come to mind: the revenue may have a contribution from the presence of each speaker or it may have a contribution from the speaker's presence, weighted by their participation in the event. In either case the coding would be similar: to each speaker corresponds a variable that is zero when the speaker is not involved and is nonzero when they are. In the first model the variable's value would be one whenever a speaker is involved in an event and zero otherwise. In the second model, those ones might be reweighted a priori.
You can try several models with several weighting schemes to see what might work the best: after all, this problem has somewhat of an exploratory nature to it.
That leaves us the practical issue of coding the models. Creating one column for each speaker is straightforward, but the concerns expressed in comments have to do with the length and complexity of the resulting formula expressions. Fortunately, formulas can be created dynamically. Here is an illustration. First, let's create some simulated data. In each row at least one speaker and usually two or three speakers are involved:
set.seed(17) n.records <- 1000 n.speakers <- 154 i <- c(rep(1,3), rep(0, n.speakers-2)) x.matrix <- sapply(1:n.records, function(j) sample(i, n.speakers)) x <- as.data.frame(t(x.matrix))
Let's name the columns "Speaker1", "Speaker2", etc (and retain this list of names for later):
colnames(x) <- colnames <- lapply(1:n.speakers, function(i) sprintf("Speaker%d",i))
Throw in a response variable:
x$y <- rnorm(n.records)
Let's see how this response depends on the speaker data. To do this, we create a formula from the column names we retained earlier:
formula <- as.formula(paste("y ~", paste(colnames, collapse="+")) fit <- lm(formula, data=x) summary(fit)
There's no problem:
R handles a formula of this length with aplomb. Extending this formula to include other variables is simple; e.g., hard-code the remainder and paste it on to the end of this computed formula.
- Solved – R: CoxPH model with a categorical variable that has too many levels
- Solved – How to predict the Revenue by using logistic Regression
- Solved – Specifying and extracting random intercepts and slopes from GAMM using bam in mgcv
- Solved – Should you standardize your variables before or after removing outliers
- Solved – Why is the intersect negative and what does the regression show