As an example application, consider following two properties of Stack Overflow users: reputation and profile view counts.
It is expected that for most users those two values will be proportional: high rep users attract more attention and therefore get more profile views.
Therefore, it is interesting to search for users who have a lot of profile views compared to their total reputation.
This could indicate that that user has an external source of fame. Or maybe just that they have interesting quirky profile pictures and names.
More mathematically, each bi-dimensional sample point is an user, and each user has two integral values ranging from 0 to +infinity:
- reputation
- number of profile views
Those two parameters are expected to be linearly dependent, and we'd like to find sample points which are the largest outliers to that assumption.
The naive solution would be of course to just take profile views, divide by reputation, and sort.
However, this would give results which are not statistically meaningful. For example, if an user answered on question, got 1 upvote, and for some reason had 10 profile views, which easy to fake, then that user would appear in front of a much more interesting candidate that has has 1000 upvotes and 5000 profile views.
In a more "real world" use case, we could try to answer for example "which startups are the most meaningful unicorns?". E.g. If you invest 1 dollar with tiny equity, you create an unicorn: https://www.linkedin.com/feed/update/urn:li:activity:6362648516858310656
Concrete clean easy-to-use real world data
To test your solution to this problem, you can just use this small (75M compressed, ~10M users) preprocessed file extracted from the 2019-03 Stack Overflow data dump:
wget https://github.com/cirosantilli/media/raw/master/stack-overflow-data-dump/2019-03/users_rep_view.dat.7z 7z x users_rep_view.dat.7z
which produces the UTF-8 encoded file users_rep_view.dat
which has a very simple plain text space separated format:
Id Reputation Views DisplayName -1 1 649 Community 1 45742 454747 Jeff_Atwood 2 3582 24787 Geoff_Dalgas 3 13591 24985 Jarrod_Dixon 4 29230 75102 Joel_Spolsky 5 39973 12147 Jon_Galloway 8 942 6661 Eggs_McLaren 9 15163 5215 Kevin_Dente 10 101 3862 Sneakers_O'Toole
This is how the data looks like on a log scale:
It would then be interesting to see if your solution actually helps us discover new unknown quirky users!
The initial data was obtained from the 2019-03 data dump as follows:
wget https://archive.org/download/stackexchange/stackoverflow.com-Users.7z # Produces Users.xml 7z x stackoverflow.com-Users.7z # Preprocess data to minimize it. ./users_xml_to_rep_view_dat.py Users.xml > users_rep_view.dat 7z a users_rep_view.dat.7z users_rep_view.dat sha256sum stackoverflow.com-Users.7z users_rep_view.dat.7z > checksums
Source for users_xml_to_rep_view_dat.py
.
After selecting your outliers by reordering users_rep_view.dat
, you can get an HTML list with hyperlinks to quickly view the top picks with:
./users_rep_view_dat_to_html.py users_rep_view.dat | head -n 1000 > users_rep_view.html xdg-open users_rep_view.html
Source for users_rep_view_dat_to_html.py
.
This script can also serve as a quick reference of how to read the data into Python.
Manual data analysis
Immediately by looking at the gnuplot graph we see that as expected:
- the data is a approximately proportional, with greater variances for low rep or low view count users
- low rep or low view count users are clearer, which means that they have higher account IDs, which means that their accounts are newer
In order to get some intuition about the data, I wanted to drill down some far up points in some interactive plotting software.
Gnuplot and Matplotlib could not handle such a large dataset, so I gave VisIt a shot for the first time and it worked. Here is a detailed overview of all plotting software I've tried: https://stackoverflow.com/questions/5854515/large-plot-20-million-samples-gigabytes-of-data/55967461#55967461
OMG that was hard to get running. I had to:
- download the executable manually, there is no Ubuntu package
- convert the data to CSV by hacking up
users_xml_to_rep_view_dat.py
quickly because I could not easily find how to feed it space separated files (lesson learnt, next time I'll go straight for CSV) - fight for 3 hours with the UI
- default point size is a pixel, which gets confused with the dust on my screen. Move to 10 pixel spheres
- there was an user with 0 profile views, and VisIt it correctly refused to do the logarithm plot, so I used data limits to get rid of that point. This reminded me that gnuplot is very permissive, and will happily plot anything you throw at it.
- add axis titles, remove username, and other things under "Controls" > "Annotations"
Here's how my VisIt window looked like after I got tired of this manual work:
The Letters are points that I selected manually with the awesome Picks feature:
- you can see the exact Id for each point by increasing the floating point precision in the Picks window > "Float Format" to
%.10g
- you can then dump all hand picked points to a txt file with "Save Picks as". This allows us to produce a clickable list of interesting profile URLs with some basic text processing
TODOs, learn how to:
- see the profile name strings, they get converted to 0 by default. I just pasted profile Ids into the browser
- pick all points in a rectangle in one go
And so finally, here are a few users which should likely show high up on your ordering:
very low rep users with huge view counts and low information profiles.
These users are likely redirecting traffic from somewhere somehow.
Related: there was a meta thread for famous question gold badge manipulation by an user, but I can't find it now.
If there are too many of such users, then our analysis will be difficult, and we would need to try to consider other parameters to avoid such "fraud":
- A 1 143100 2445750 https://stackoverflow.com/users/2445750/muhammad-mahtab-saleem
- D 60 51111 2139869 https://stackoverflow.com/users/2139869/xxn
- E 40 23067 5740196 https://stackoverflow.com/users/5740196/listcrawler
- F 11 7738 3313079 https://stackoverflow.com/users/3313079/rikitikitaco
- G 136 11123 4102129 https://stackoverflow.com/users/4102129/abhishek-deshpande
- K 377 24453 1012351 https://stackoverflow.com/users/1012351/overstack
- L 1489 57515 1249338 https://stackoverflow.com/users/1249338/frosty
- M 1767 40986 2578799 https://stackoverflow.com/users/2578799/naresh-walia
- I find this cluster of users interesting, all so nearby in the graph:
- H 58 4331 1818755 https://stackoverflow.com/users/1818755/eerongal
- I 103 4366 1816274 https://stackoverflow.com/users/1816274/angelov
- J 157 4688 688552 https://stackoverflow.com/users/688552/oylex
external fame:
- O 29799 170854 2274694 https://stackoverflow.com/users/2274694/lyndsey-scottex Victoria's Secret model: https://en.wikipedia.org/wiki/Lyndsey_Scott
- P 45742 454747 1 https://stackoverflow.com/users/1/jeff-atwood SO co-founder
- Y 29230 75102 4 https://stackoverflow.com/users/4/joel-spolsky SO co-founder
- the highest reputation users tend to get more profile views because they appear on the "users with highest reputation" Google queries / listings:
- U 542861 401220 88656 https://stackoverflow.com/users/88656/eric-lippert involved in C# design
- V 852319 396830 157882 https://stackoverflow.com/users/157882/balusc top #2 user, insane ammount of answers
quirky profiles:
- N 13690 108073 63550 https://stackoverflow.com/users/63550/peter-mortensen That owl picture! I also think he was a moderator previously.
- R 143904 144287 895245 https://stackoverflow.com/users/895245/ciro-santilli-%e6%96%b0%e7%96%86%e6%94%b9%e9%80%a0%e4%b8%ad%e5%bf%83996icu%e5%85%ad%e5%9b%9b%e4%ba%8b%e4%bb%b6
- T 291742 161929 560648 https://stackoverflow.com/users/560648/lightness-races-in-orbit
high rep users that were suspended at the time. Ah, the silly your rep goes to 1 rule:
- B 1 77456 285587 https://stackoverflow.com/users/285587/your-common-sense
not sure, I'm tempted to say view manipulation:
- Q 65788 126085 50776 https://stackoverflow.com/users/50776/casperone
- S 15655 81541 293594 https://stackoverflow.com/users/293594/xnx
- W 12019 37047 2227834 https://stackoverflow.com/users/2227834/unheilig
- X 1421 27963 1255427 https://stackoverflow.com/users/1255427/jack-nicholson
Possible solutions
I've heard about the Wilson score confidence interval from https://www.evanmiller.org/how-not-to-sort-by-average-rating.html which "balance[s] the proportion of positive ratings with the uncertainty of a small number of observations", but I'm not sure how to map that to this problem.
In that blog post, the author recommends that algorithm to find items that have a lot more upvotes than downvotes, but I'm not sure if the same idea applies to the upvote / profile view problem. I was thinking of taking:
- profile views == upvotes there
- upvotes here == downvotes there (both "bad")
but I'm not sure if it makes sense because on the up/downvote problem, each item being sorted has N 0 / 1 vote events. But on my problem, each item has two events associated to it: getting the upvote, and getting the profile view.
Is there a well known algorithm that gives good results for this kind of problem? Even knowing the precise problem name would help me find existing literature.
Bibliography
- https://meta.stackoverflow.com/questions/307117/are-profile-views-on-stack-overflow-positively-correlated-to-the-level-of-reputa
- Test for bivariate outliers
- https://stackoverflow.com/questions/41462073/multivariate-outlier-detection-using-r-with-probability
- Is there a simple way of detecting outliers?
- How should outliers be dealt with in linear regression analysis?
- https://math.meta.stackexchange.com/questions/26137/who-maximizes-the-ratio-of-people-reached-to-questions-answered
Tested in Ubuntu 18.10, VisIt 2.13.3.
Best Answer
I think the Wilson score confidence interval can be applied directly to your problem. The score used in the blog was a lower bound of the confidence interval instead of an expected value.
Another method for such problem is to correct (bias) our estimation towards some prior knowledge we have, for example the overall view/rep ratio.
Say the number of views per reputation follows a normal distribution $v sim N(mu,sigma) $, then the view/rep ratio can be interpreted as the maximum likelihood estimation (MLE) of the distribution mean $mu$.
The problem is, as you mentioned, when there is insufficient data, the MLE result is of high variance (overfitting). A simple solution for this is to introduce a prior distribution of $mu$ and do maximum a posteriori estimation (MAP). The prior distribution $p(mu)$ can be another normal distribution estimated from all the samples we have.
In practice, this is essentially a weighted average of the overall view/rep ratio and the user view/rep ratio, $$μ_{MAP}=frac{nmu_{MLE}+cmu_0}{n+c}$$ where $n$ is the number of reps a user have, $c$ is a constant, $mu_{MLE}$ is the view/rep ratio of the user and $mu_0$ the overall view/rep ratio.
To compare the two methods (Wilson score confidence interval lower bound and MAP), they both give an accurate estimation when there are sufficient data (reps), when the number of reps is small, Wilson lower bound method will bias towards zero and MAP will bias towards the mean.
Similar Posts:
- Solved – Trying to compute Gini index on StackOverflow reputation distribution
- Solved – Detecting a given face in a database of facial images
- Solved – Detecting a given face in a database of facial images
- Solved – How to generate the most common clickstream sequence
- Solved – Interpreting results of lightFM (factorization machines for collaborative filtering)