This relates to a previous question of mine which didn't gain many responses, perhaps because it wasn't very clear nd well written. I hope this time I will be more accurate and get your much appreciated assistance.
I am analyzing results of a biological experiment. The results given as a single value ( non-negative integer) per genomic position. I am interested in valleys, or local minimam over this series of values.
I wish to control the false positives rate and get the significance for each local minima. I can shuffle the raw data which was used to produce the data.
So what I do is to shuffle the raw data, create the new series of values,search for all local minima and keep their values.
Now, I have something like this:
data_set local_minima_values ============================= true_data 4 9 1 27 12 0 0 2 5 32 0 1 5 70 2 sim_1 14 25 94 59 32 sim_2 52 0 14 74 82 12 54 ...
Note the number of local minima naturally varies between simulations.
So, my idea was to calculate an ECDF for each simulation and then combine those ECDFs into a single "average ECDF" which represents the null hypothesis. Then, I can assign a p-value for each local minima from the true data, and see how significant ('surprising') it is.
My questions are:
- Does this make sense?
- How do I create an average ECDF? I can't just merge the values from all simulation together and get and ECDF for this merged set, since the number of minima found in each simulation differs, and I think all simulations should have the same contribution to the average ECDF, or am I wrong?
- How should I take the number of simulations (shuffles) into account?
Thanks,
Dave
p.s. I'm working with R.
Best Answer
To average the ECDFs, I'd do something like:
impute_resolution = 1e3 values_to_impute = seq( min(my_data$true_data) , max(my_data$true_data) , length.out = impute_resoluton ) ecdfs = matrix(NA,nrow=length(values_to_impute)) for(i in 1:(ncol(my_data)-1)){ #assumes column 1 is true_data this_ecdf = ecdf(my_data[,i+1]) ecdfs[i,] = this_ecdf(values_to_impute) } mean_ecdf = colMeans(ecdfs) plot( x = values_to_impute , y = mean_ecdf , type = 'l' )
Similar Posts:
- Solved – Estimating the area between two ecdfs
- Solved – Neural networks – are local minima bad
- Solved – Strange behavior of R function ecdf
- Solved – Expected number I will be on after drawing cards until I get an ace, 2, 3, and so forth
- Solved – How to compare observed distribution to sampled distributions, using cumulative distibution function