I read this paper on a multilabel classification task. The authors evaluate their models on F1-Score but the do not mention if this is the macro, micro or weighted F1-Score.

They only mention:

We chose F1 score as the metric for evaluating

our multi-label classication system's performance. F1

score is the harmonic mean of precision (the fraction of

returned results that are correct) and recall (the frac-

tion of correct results that are returned).

From that, can I guess which F1-Score I should use to reproduce their results with scikit-learn? Or is it obvious which one is used by convention?

**Edit:**

I am not sure why this question is marked as off-topic and what would make it on topic, so I try to clarify my question and will be grateful for indications on how and where to ask this qustion.

As I understand it, the difference between the three F1-score calculations is the following:

*macro*calculates F1-score for each label and summs them up, with each label the same weight: $f1 = sum f1_n *frac{1}{n}$*weighted*calculates F1-score for each label and sums them up multiplied by the support of each label: $f1 = sum f1_n * w_n$*micro*calculates a total f1-score by calculating precision and recall with the total true positives, false positives and false negatives.

The text in the paper seem to indicate that micro-f1-score is used, because nothing else is mentioned. is it save to think so?

**Contents**hide

#### Best Answer

I thought the `macro`

in `macro F1`

is concentrating on the `precision`

and `recall`

other than the F1. We can calculate the macro precision for each label, and find their unweighted mean; by the same token its macro recall for each label, and find their unweighted mean. Once we get the macro recall and macro precision we can obtain the macro F1(please refer to here for more information). The same goes for micro F1 but we calculate globally by counting the total true positives, false negatives and false positives.

Macro F1 weighs each class equally while micro F1 weighs each sample equally, and in this case, most probably the F1 defaulted to the macro F1 since it's hard to make every tag with equal amount to prevent a bad micro F1 caused by the class imbalance(all tags would most probably not be of equal amount).

Let's come back to the paper, and in the paper, we can probably get some more hints from this snippet:

To debug our multi-label classification system, we examined which of the 20 most common tags had the worst performing classifiers (lowest F1 scores).

A macro F1 also makes error analysis easier.

### Similar Posts:

- Solved – Are the total false positives and false negatives of a large confusion matrix equal
- Solved – Using micro average vs. macro average vs. normal versions of precision and recall for a binary classifier
- Solved – Understanding Precision and Recall Results on a Binary Classifier
- Solved – Area under Precision-Recall Curve (AUC of PR-curve) and Average Precision (AP)
- Solved – Calculate precision and recall