Classification

Logistic regression returns a probability
This probability can be converted to a binary value by making use of a classification threshold
Classification/decision threshold: the value which splits between true and false
s. confusion matrix , which contains information used for the performance metrics

Evaluation metrics

One possibility: accuracy (fraction of correct predictions over total predictions)
- $\dfrac{\text{TP} + \text{TN}}{\text{all predictions}}$
- Poor metric, especially when working with datasets with class imbalance (significant difference between number of positive and negative labels) e.g. when prediction would almost always return false – accuracy would be close to 100% but this doesn’t say anything about the model!
Precision and recall
Bias

Model performance for all possible classification thresholds

Receiver Operating Characteristics (ROC) curve

Each point on the curve is the TP rate ( recall ) and FP rate at one decision threshold

$$ \begin{align} \text{TPR} &= \dfrac{\text{TP}}{\text{TP} + \text{FN}} = \dfrac{\text{TP}}{\text{actual positives}}\
\text{FPR} &= \dfrac{\text{FP}}{\text{FP} + \text{TN}} = \dfrac{\text{FP}}{\text{actual negatives}} \end{align} $$

Area under the ROC Curve (AUC)

Probability of getting a pairwise (pick a random positive and a random negative) comparison correct, i.e. assigning a higher score to the positive — probability of the model ranking a random positive example higher than a random negative example
Aggregate measure of performance across all possible decision thresholds (as opposed to calculating TPR and FPR for each threshold in the ROC)
e.g. probability that a random green point (actual positive) is to the right of a random red point (actual negative)
perfect ROC with AUC = 1.0

Advantages:

AUC is scale-invariant, independent on absolute values of the predictions
AUC is classification-threshold-invariant

Disadvantages:

When scale invariance is not always desirable (when we need well calibrated probability outputs)
When classification-threshold-invariance isn’t desired, e.g. when there is a big cost difference between false negatives and false positives
- e.g. in email spam detection, a false positive (legit email marked as spam) is much worse than a false negative (spam comes through)
- we would therefore want to minimise false positives (lower the classification threshold)