Suppose I have an ordered vector where the first element is the number of visits to a website in a given period of time by the unique IP with the highest number of visits, the second element is the number of visits by the unique IP with the second highest number of visits, and so on. I understand there may be per site variations, but is there in general an assumed pattern to the shape of this vector? Does it, for instance, follow a power-law distribution?
No, unique visitors to a website do not follow a power law.
In the last few years, there has been increasing rigor in testing power law claims (e.g., Clauset, Shalizi and Newman 2009). Apparently, past claims often weren't well tested and it was common to plot the data on a log-log scale and rely on the "eyeball test" to demonstrate a straight line. Now that formal tests are more common, many distributions turn out not to follow power laws.
The best two references I know that examine user visits on the web are Ali and Scarr (2007) and Clauset, Shalizi and Newman (2009).
Ali and Scarr (2007) looked at a random sample of user clicks on a Yahoo website and concluded:
Prevailing wisdom is that the distribution of web clicks and pageviews follows a scale-free power law distribution. However, we have found that a statistically significantly better description of the data is the scale-sensitive Zipf- Mandelbrot distribution and that mixtures thereof further enhances the fit. Previous analyses have three disadvantages: they have used a small set of candidate distributions, analyzed out-of-date user web behavior (circa 1998) and used questionable statistical methodologies. Although we cannot preclude that a better fitting distribution may not one day be found, we can say for sure that the scale-sensitive Zipf-Mandelbrot distribution provides a statistically significantly stronger fit to the data than the scale-free power-law or Zipf on a variety of verticals from the Yahoo domain.
Here is a histogram of individual user clicks over a month and their same data on a log-log plot, with different models they compared. The data are clearly not on a straight log-log line expected from a scale-free power distribution.
Clauset, Shalizi and Newman (2009) compared power law explanations with alternative hypotheses using likelihood ratio tests and concluded both web hits and links "cannot plausibly be considered to follow a power law." Their data for the former were web hits by customers of the America Online Internet service in a single day and for the latter were links to web sites found in a 1997 web crawl of about 200 million web pages. The below images give the cumulative distribution functions P(x) and their maximum likelihood power-law ﬁts.
For both these data sets, Clauset, Shalizi and Newman found that power distributions with exponential cutoﬀs to modify the extreme tail of the distribution were clearly better than pure power law distributions and that log-normal distributions were also good fits. (They also looked at exponential and stretched exponential hypotheses.)
If you have a dataset in hand and are not just idly curious, you should fit it with different models and compare them (in R: pchisq(2 * (logLik(model1) – logLik(model2)), df = 1, lower.tail = FALSE) ). I confess I have no idea offhand how to model a zero-adjusted ZM model. Ron Pearson has blogged about ZM distributions and there is apparently an R package zipfR. Me, I would probably start with a negative binomial model but I am not a real statistician (and I'd love their opinions).
(I also want to second commenter @richiemorrisroe above who points out data are likely influenced by factors unrelated to individual human behavior, like programs crawling the web and IP addresses that represent many people's computers.)