I have two datasets consisting of metrics from several experiments. Dataset 1 is the collection of results of experiments E performed by user A on product A, repeated N times. Dataset 2 is the collection of results of the same experiments E performed by the same user A on product B, repeated the same N times.
N is not large, and cannot be large due to practical limitations (typically around 15-20). The data CANNOT be assumed to be Gaussian. In some cases, it is known to be definitely not normal, and in some cases we do not know for sure. It is just an unknown distribution. We know that the metrics cannot be negative. That is pretty much the only definite information we know.
Using this data, how do we compare product A and product B and give a result with some statistical significance? How do we devise a hypothesis test to check if A > B with some x% confidence?
You could do a permutation/randomization test.
Possibly Wilcoxon rank sum will answer your question as well, although permutation test is probably closer to what you want.
R there is
perm.test in the
exactRankTests package that seems made for your problem.