This is in the domain of website traffic.

Suppose I have two samples, a "pre"-survey and a "post"-survey both done before and after, respectively, a change to a website was made. Because of the nature of website traffic data, it's impossible to get the same subjects to take both surveys.

Suppose also that the sample size of the pre-survey is around 1,800 and the post-release survey's sample size is around 1,200. One question entails a 0-10 Likert scale asking how difficult or easy it was to perform a task (0 being difficult, 10 being easiest), and I would like to know whether or not there was improvement from one sample to another.

I am not familiar with working with Likert scales. But given my background (mathematical statistics is my forte), here are the concerns that come to mind:

- Measurement error is a really huge factor, especially given that there aren't concrete differences between individual responses on a 0-10 scale. There's not a concrete difference between, say, someone choosing a 2 over a 3. It's entirely plausible that someone could have done that depending on their mood, for example.
- The varying sample size is also a concern. The post-survey has a sample size that is 50% larger than the pre-survey.

**What is a suitable metric for comparing these two outcomes?**

Here's what my thoughts are:

- Percents and mean comparisons are not suitable for this, particularly due to the varying sample sizes.
- Traditional hypothesis testing $p$-values are not suitable for this, since given that the sample sizes are so large already, the $p$-value is going to be small anyway. (Also, as mentioned above, I don't think the mean is appropriate.)

I thought percentiles, based on the 0-10 scores, would be the most appropriate because these are not (explicitly) dependent on the sample size.

It may also be worth noting that the data are very skewed left, with around 1/3 of responses in both the pre-survey and post-survey responding with a 10.

**Contents**hide

#### Best Answer

I would conduct a $chi^2$ goodness of fit test to assess whether the distribution of Likert scores differ before and after.

In a conventional $chi^2$ test, the test statistic is $Sigma (observed-expected)^2/expected$. Substitute the $expected$ results with the pre-website survey percentages, and the $observed$ results with the post-website survey percentages. Now $chi^2 = Sigma (Post-Pre)^2/Pre$. You may have to combine some categories if any percentages in the pre-survey are zero. Remember to change the degrees of freedom accordingly.

We are testing $H_0: chi^2=0$ (no change) vs. $H_1: chi^2>0$ (some change in Likert distribution). Under $H_0, chi^2 sim chi^2_{10}$. The degrees of freedom are $10$ as you have $11$ classes.

In terms of summarising the data, I would present the pre and post percentages graphically side-by-side. If the $chi^2$ test is significant, then you can assess the change in the categories visually.