Solved – Assuming two Gaussian distributions of equal mean and variance, then how different can we expect the top X members of each group to be

Here's the thread I got the idea from: http://www.quora.com/Do-men-have-a-wider-variance-of-intelligence-than-women/answer/Ed-Yong

Basically, this is a model that might be able to explain why there aren't more females in prestigious math/science competitions – it might be a statistical artifact arising from the simple fact that there are far more males than females in math/science. If this model applies, then we may not need to assume that male intelligence has higher variance than female intelligence.

The question I'd like to see addressed: If we assume equal means and equal variances (but different sample sizes), then is the model in the paper still the best model when used for predicting, say, the gender composition of the team of the 5-10 best players? Rather than just the gender composition of the grandmaster?

http://rspb.royalsocietypublishing.org/content/276/1659/1161.full#sec-3 has the diagram and use of the model

They basically used pairing between the top 100 males and top 100 females. Is that a valid assumption to make though? It works for grandmasters – that's true – but would it work if we're trying to select the top 10 people in any field? It's entirely possible, after all, that the expected distributions would be different if we're trying to select from a random distribution of the top 5 players of each gender, rather than the n-th ranked player of each gender.

As you increase the number of players you select for a "winning" team, for example, then maybe the distributions play out in a different way. I would expect the smaller group to have higher variance in mean than the larger group. We know that to be true when averaging over the entire population distribution (as a consequence of the central limit theorem). But what if we just want 10 people from each population instead? The fact is that a lot of "potentially" top people will end up dropping out because they would do something other than spend hours a day to practice for a "winning team"

High variability of the extreme value though – that makes sense if we're talking about the very top. In a large population, the extreme value is going to be very consistent. Whereas in a small population, the extreme value is going to have A LOT of variability – but that extreme value spends far more time in the left part of the (mean of extreme values) as compared to the right part of it. So if you had a head-to-head match up most years, the population with the larger sample size will win.

The thing is, what about a head-to-head matchup of the top 10 members of each distribution? It would be some sort of average between the model the paper used (1 to 1 matchups) and the model where we simply had matchups of the two entire populations with each other.

Let's look at the top 3 of 100 Gaussians vs. the top 3 of 1000.
Real statisticians will give formulas for this and more; for the rest of us, here's a little Monte Carlo. The intent of the code is to give a rough idea of the distributions of $X_{(N-2)} X_{(N-1)} X_{(N)}$;
running it gives

# top 3 of  100 Gaussians, medians: [[ 2.   2.1  2.4]] # top 3 of 1000 Gaussians, medians: [[ 2.8  2.9  3.2]] 

If someone could do this in R with rug plots, that would certainly be clearer.

#!/usr/bin/env python  # Monte Carlo the top 3 of 100 / of 1000 Gaussians # top 3 of  100 Gaussians, medians: [[ 2.   2.1  2.4]] # top 3 of 1000 Gaussians, medians: [[ 2.8  2.9  3.2]] # http://stats.stackexchange.com/questions/12647/assuming-two-gaussian-distributions-of-equal-mean-and-variance-then-how-differen # cf. Wikipedia World_record_progression_100_metres_men / women  import sys import numpy as np  top = 3 Nx = 100 Ny = 1000 nmonte = 100 percentiles = [50] seed = 1 exec "n".join( sys.argv[1:] )  # run this.py top= ... np.set_printoptions( 1)  # .1f np.random.seed(seed) print "Monte Carlo the top %d of many Gaussians:" % top      # sample Nx / Ny Gaussians, nmonte times -- X = np.random.normal( size=(nmonte,Nx) ) Y = np.random.normal( size=(nmonte,Ny) )      # top 3 or so -- Xtop = np.sort( X, axis=1 )[:,-top:] Ytop = np.sort( Y, axis=1 )[:,-top:]      # medians (any percentiles, but how display ?) -- Xp = np.array( np.percentile( Xtop, percentiles, axis=0 )) Yp = np.array( np.percentile( Ytop, percentiles, axis=0 )) print "top %d of %4d Gaussians, medians: %s" % (top, Nx, Xp) print "top %d of %4d Gaussians, medians: %s" % (top, Ny, Yp) 

Similar Posts:

Rate this post

Leave a Comment