We are conducting research on neighborhood mail response behavior, i.e. what percentage of people in a neighborhood reply to a piece of mail.
Based on regression analysis, we know which factors (% black, % poor, etc.) influence mail response rates. I’m toying with the idea of using the significant variables from the regression model to construct clusters that could inform outreach/advertising in these different neighborhoods. In other words, clusters would help us identify what combination of reasons leads to different response rates in different areas.
How can this be done? I want the clustering to be informed by the mail response rates. Should I just include response rates as one of the clustering variables? Or is there a way to include response rates as a dependent variable? The clustering techniques I am familiar with are unsupervised, without a dependent variable.
Trying to conclude "which factors influence mail response" based on the regression has the same problem as assuming causation from correlation: it totally ignores confounders.
You can find a much better approach to this exact kind of problem from a recent paper: The Blessings of Multiple Causes.