This is a pretty simple question and I'm wondering how to go about finding a solution. I have the following data.
ttt = data.frame(name=c("A","B","C","D","E","F"), count=c(150,250,350,550,150,50), returns=c(10,50,60,100,80,10), calls_bad=c(5,30,20,15,15,20), weight=c(0.20,0.30,0.40,0.40,0.20,0.10)) ttt
For each name, I'm trying to find the return rate, which is just returns/count. However, I want to know how to weight that by how high or low count is. Name F, which only has a count of 50, should be weighted less (weight is 0.10) than name D, who has a count of 550 (weight is 0.40).
ttt$returns_rate = round(ttt$returns/ttt$count, 2) ttt name count returns weight returns_rate 1 A 150 10 0.2 0.07 2 B 250 50 0.3 0.20 3 C 350 60 0.4 0.17 4 D 550 100 0.4 0.18 5 E 150 80 0.2 0.53 6 F 50 10 0.1 0.20 d1 = ddply(ttt, .(name), function(x) data.frame(score=weighted.mean(x$returns_rate, x$weight))) d1 name score 1 A 0.07 2 B 0.20 3 C 0.17 4 D 0.18 5 E 0.53 6 F 0.20
Here are the weights:
Count <= 100 = 10% Count >= 101 | Count <= 200 = 20% Count >= 201 | Count <= 300 = 30% Count >= 301 = 40%
Given what I'm trying to achieve, I thought I'd want a weight average but end up with the following results. However, this can't be right as the returns_rate is equal to the score.
How can I go about getting the rate while accounting for the number of count?
Thanks! I must've fallen asleep in basic math on this one.
EDIT:
In lieu of the comments =
Eventually, the plan is to combine the call_bad rate (calls_bad/count) and the returns rate (returns/count) into one 'score'/'metric' for each name. However, because count varies significantly by name, I'm working under the assumption that I need to 'weight' the data in order to account for the impact of small 'counts'.
Basically, I want to find (returns/count) and (calls_bad/count), and combine these values into one value, while accounting for the fact that someone with a count of 50 could influence the data in a bad way, thus why I'm thinking of using weights.
EDIT 2:
ttt = data.frame(name=c("A","B","C","D","E","F"), count=c(150,250,350,550,150,50), returns=c(10,50,60,100,80,10), calls_bad=c(5,30,20,15,15,20), weight=c(0.20,0.30,0.40,0.40,0.20,0.10)) ttt$returns_rate = round(ttt$returns/ttt$count, 2) ttt$calls_bad_rate = round(ttt$calls_bad/ttt$count, 2) ttt
There only two numbers to "combine" or "average", 'returns rate' and 'calls bad rate'.
ttt$combined = round(ttt$returns_rate + ttt$calls_bad_rate / 2, 2)
But given how count varies "significantly" by name, I thoughts weight were appropriate based on the number of count.
Best Answer
I was writing my own weighted average algorithm yesterday and it applies to what you’re looking to do. Your problem is what you're calling returns_rate and the logic behind it. For these examples we have a new field called weighted_score or you could call it weighted_returns if you wanted. The point is that its weighted.
id | name | count | returns | weight | weighted_score ----------------------------------------------------- 1 A 150 10 0.2 2 2 B 250 50 0.3 15 3 C 350 60 0.4 24 4 D 550 100 0.4 40 5 E 150 80 0.2 16 6 F 50 10 0.1 1 ----------------------------------------------------- sums 1.6 98 weighted_average 61.3
Logic:
weighted_score = returns*weight weighted_average = sum(weighted_scores)/sum(weights)
I see that you have some logic to determine the weight however your upper limit is 40%. There really should be 0% through 100% in there. You can also calculate the weight on the fly for any given data set by finding the highest “count” and dividing the “count” in each row by that highest number and this gives you the appropriate 0% – 100% weights.
Example:
id | name | count | returns | weight | weighted_score ----------------------------------------------------- 1 A 150 10 0.2727 2.7 2 B 250 50 0.4545 22.7 3 C 350 60 0.6364 38.2 4 D 550 100 1.0000 100 5 E 150 80 0.2727 21.8 6 F 50 10 0.0909 0.9 ----------------------------------------------------- sums 2.7273 186.3636 weighted_average 68.3
Logic:
weight = count/550
Another simpler option that doesn’t require so much processing trying to figure out the appropriate weight is to just create a static variable to use as your ceiling. Any count lower than the ceiling will be weighted with lower importance and anything above the ceiling is weighted at 100%.
id | name | count | returns | weight | weighted_score ----------------------------------------------------- 1 A 150 10 0.4983 2.7 2 B 250 50 0.8306 22.7 3 C 350 60 1.0000 38.2 4 D 550 100 1.0000 100 5 E 150 80 0.4983 21.8 6 F 50 10 0.1661 0.9 ----------------------------------------------------- sums 3.9934 248.0399 weighted_average 62.1
Logic:
if(count < 301) weight = count/301 if(count > 301) weight = count/count
Similar Posts:
- Solved – Understanding simple weighted average
- Solved – How to find the optimal threshold for the Weighted f1 score in a binary classification problem
- Solved – Make R report error on using non-existent column name in a data frame
- Solved – Combining Z Scores by Weighted Average. Sanity Check Please
- Solved – What does weighted cumulative frequency distribution mean