I have a total population of several hundred houses, which have different sizes, prices, and ages.

I want to be able to select a random sample, of a few dozen or so houses, and test how representative they are of the entire portfolio. Critically, I want the comparisons to be across all the variables, not just comparing prices for example. Is there some sort of test which will give me a P value on how confident I am that my random sample is representative of the portfolio?

I'm sure this is probably a trivial question, but I'm a novice statistician, and was hoping for some advice on the best way to do this.

Thank you kindly.

**Contents**hide

#### Best Answer

You have several choices: multivariate analysis of variance (MANOVA), logistic regression, and discriminant analysis might all work. Using any of these you could test how well the features (size, price, age, etc.) can discriminate between each subset of houses. You will be able to obtain a *p*-value telling to what extent group differences exist that are unusual given the fact that both groups were chosen randomly from the same population.

[EDIT] It's not as if that *p*-value will help you decide *whether* chance was at play — you'll know that it was, having sampled randomly. But in your case, the *p* will, indirectly, say something about the magnitude of the differences. Problem is, it'll be highly affected by sample size, so you might be better off assessing the magnitude of those differences directly. I.e., you can simply ask how large is the group difference in mean size, price, age, etc. — probably in terms of an effect-size indicator such as Cohen's *d*. This will express a mean difference while taking into account the within-group variability.

You might also want to look into measures of similarity or of distance. There are many methods of characterizing distance in multivariate space.