Solved – Partial least squares regression for categorical factor in R

I adjust the partial least squares regression for one categorical factor (2 levels – be or nottobe) with with the pls package in R. I try to use round() function in the predict values for take the decision if the result are the first or second level in my factor. Does this approach sound correct? … Read more

Solved – Overall significance of a categorical variables in logistic regression

I have seen two approaches in binary logistic regression with categorical independent variables (IV) with more than two levels. In one approach, a reference category for the IV is defined and the rest of the categories are tested regarding this reference category,thus obtaining p-values for each category compared to the reference category (which is what … Read more

Solved – Regression with Lots of Categorical Variables

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature. I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so … Read more

Solved – Dealing with a categorical variable that can take multiple levels simultaneously

I recently posted a question with many parts and I'd like to focus in on just one issue that I didn't emphasize in the original post. My data is a list of records, each one representing an educational seminar event. I have a continuous variable that represents the revenue brought in by each seminar, which … Read more

Solved – R caret package and dumthe variables

I've been trying to run boosted regression tree modelling on spatial data using the caret package in R. My predictor variables were all extracted from raster files on the environment, fx. soil type and landcover. Since these two latter variables are actually factors (but the codes are numeric), I have been creating dummy variables for … Read more

Solved – Correlation multinomial distribution

Problem 1.14 from Categorical Data Analysis 2nd. For the multinomial distribution, show that $$operatorname{corr}(n_j,n_k)=frac{-pi_jpi_k}{sqrt{pi_j(1-pi_j)pi_k(1-pi_k)}}$$ Show that $operatorname{corr}(n_1,n_2)=-1$ when $c=2$. The multinomial density is $$p(n_1,n_2,dots,n_{c-1})=binom{n!}{n_1!,dots,n_c!}pi_1^{n_1}dotspi_c^{n_c}$$ Let $n_j=sum_i y_{ij}$ where each $y_{ij}$ is Bernoulli with $E[y_{ij},y_{ik}]=0$, $E[y_{ij}]=pi_j$ and $E[y_{ik}]=pi_k$ Then $sum_j n_j=n$, with dimension $(c-1)$ since $n_c=n-(n_1+n_2+,dots,+n_{c-1})$. So each $n_jsim Bin(n,pi_j)$ $$begin{cases}E[n_j]=npi_j\ operatorname{Var}(n_j)=frac{pi_j(1-pi_j)}{n}end{cases}$$ then $$operatorname{corr}(n_j,n_k)=frac{-npi_jpi_k}{sqrt{npi_j(1-pi-pi_j)npi_k(1-pi_k)}}=frac{-pi_jpi_k}{sqrt{pi_j(1-pi_j)pi_k(1-pi_k)}}.$$ Is that … Read more

Solved – Logistic regression on categorical data

I have large dataset (around 2 million records and 300 features) with a lot of missing data. Most of the independent variables are categorical (some of these variables have more than 40 valid values). The outcome is either Y or N. The Y outcome is a rare event: around 98% of outcomes are N. I'm … Read more

Solved – How to develop a dataset for the gravity model of international Trade

The gravity model of International trade is used to estimate the determinants of bilateral trade between countries. In developing the dataset for the gravity model, do i need to manually pair the countries in terms of total value of exports/imports and the distance or i just need to enter them one by one with their … Read more

Solved – Calculating predicted values from categorical predictors in logistic regression

Context: I am working with an ordinal logistic model and trying to interpret/present the results. The model has two continuous predictors of interests, and a mix of continuous and categorical controls. I was hoping to graph the predicted likelihood of the top outcome (being accepted into a school) across multiple levels of my IVs of … Read more

Solved – Testing for contingency table with three variables

How does one make conclusions using contingency table with three variables? In the two variable case, you can test for association through the independence test using pearsons chisq statistic, but what happens with 3 variables? For example(hypothetical data set): I have gender(male/female), smoking(yes/no) and cancer(yes/no) data for a population. How does one find out if … Read more