Assessing Model Fairness Across Binary Protected Attributes
fairmetrics.Rmd
Introduction
This vingette demonstrates how to obtain, report and interpret model
fairness metrics for binary protected attributes with the
fairmetrics
package. We illustrate this through a case
study based on a preprocessed version of the MIMIC II clinical database
[1], which has been previously studied to explore the relationship
between indwelling arterial catheters in hemodynamically stable patients
and respiratory failure in relation to mortality outcomes [2]. The
original, unprocessed dataset is publicly available through PhysioNet
[3]. A preprocessed version of this dataset is included in the
fairmetrics
package as the mimic_preprocessed
and is used in this vingette.
Data Split and Model Construction
In this setting, we construct a model that will predict 28-day
mortality (day_28_flg
). To do this, we split the dataset
into a training and testing sets and fit a random forest model. The
first 700 patients are used as the training set and the remaining
patients are used as the testing set. After the model is fit, it is used
to predict 28-day mortality. The predicted probabilities are saved as
new column in the testing data and are used to assess model
fairness.
# Load required libraries
library(dplyr)
library(pROC)
library(fairmetrics)
# Set seed for reproducibility
set.seed(1)
# Load the MIMIC-II preprocessed data
data("mimic_preprocessed")
# Use 700 labels to train the mimic
train_data <- mimic_preprocessed %>%
filter(row_number() <= 700)
# Test the model on the remaining data
test_data <- mimic %>%
filter(row_number() > 700)
# Fit a random forest model
rf_model <- randomForest(factor(day_28_flg) ~ ., data = train_data, ntree = 1000)
# Save model predcition
test_data$pred <- predict(rf_model, newdata = test_data, type = "prob")[, 2]
Fairness Evaluation
The fairmetrics
package is used to assess model fairness
across binary protected attributes. This means that the unique values in
the protected attribute column need to be exactly two. To evaluate
fairness of the random forest model which we fit, we examine patient
gender as the binary protected attribute.
# Recode gender variable explicitly for readability:
test_data <- test_data %>%
mutate(gender = ifelse(gender_num == 1, "Male", "Female"))
Since many fairness metrics require binary predictions, we threshold
the predicted probabilities using a fixed cutoff. We set a threshold of
0.41 to maintain the overall false positive rate (FPR) at approximately
5%. To evaluate specific fairness metrics, its possible to do so with
the eval_*
functions (for a list of the functions contained
in fairmetrics
see here).
For example, if we are interested in calculating the statistical parity
of our model across gender (here assumed to be binary), we write:
eval_stats_parity(
data = test_data,
outcome = "day_28_flg",
group = "gender",
probs = "pred",
cutoff = 0.41,
message = TRUE
)
#> There is evidence that model does not satisfy statistical parity.
#> Metric GroupFemale GroupMale Difference 95% Diff CI Ratio
#> 1 Positive Prediction Rate 0.12 0.06 0.06 [0.02, 0.1] 2
#> 95% Ratio CI
#> 1 [1.32, 3.03]
The dataframe returned gives the positive prediction rate between the
groups defined by the binary protected attribute
(GroupFemale
and GroupMale
in this case), the
difference and ratios between the groups and the bootstrap calculated
confidence intervals for the estimated difference and ratios. For
inference, it can be considered that confidence interval which contains
0 in its range for difference or 1 for ratio as insignificant.
All the eval_*
functions follow a the same syntax with
the exception of eval_cond_stats_partiy()
which is used to
evaluate conditional statistical parity, which requires a group to
condition on. If we are interested in calculating statistical parity
across male and female patients aged 60 and up, we would write:
eval_cond_stats_parity(
data = test_data,
outcome = "day_28_flg",
group = "gender",
group2 = "age",
condition = ">=60",
probs = "pred",
cutoff = 0.41,
message = TRUE
)
#> There is not enough evidence that the model does not satisfy
#> statistical parity.
#> Metric GroupFemale GroupMale Difference 95% Diff CI Ratio
#> 1 Positive Prediction Rate 0.24 0.17 0.07 [0, 0.14] 1.41
#> 95% Ratio CI
#> 1 [0.97, 2.05]
To calculate various fairness metrics for the model simultaneously,
we pass our test data with its predicted results into the
get_fairness_metrics
function.
get_fairness_metrics(
data = test_data,
outcome = "day_28_flg",
group = "gender",
group2 = "age",
condition = ">=60",
probs = "pred",
cutoff = 0.41
)
#> $performance
#> Metric GroupFemale GroupMale
#> 1 Positive Prediction Rate 0.12 0.06
#> 2 Conditional Positive Prediction Rate 0.24 0.17
#> 3 False Negative Rate 0.56 0.70
#> 4 False Positive Rate 0.06 0.03
#> 5 Avg. Predicted Positive Prob. 0.45 0.35
#> 6 Avg. Predicted Negative Prob. 0.15 0.11
#> 7 Positive Predictive Value 0.61 0.65
#> 8 Negative Predictive Value 0.92 0.91
#> 9 Brier Score 0.09 0.08
#> 10 Accuracy 0.74 0.73
#> 11 (False Negative)/(False Positive) Ratio 1.14 3.00
#>
#> $fairness
#> Metric Difference 95% Diff CI Ratio 95% Ratio CI
#> 1 Statistical Parity 0.06 [0.02, 0.1] 2.00 [1.33, 3.01]
#> 2 Conditional Statistical Parity 0.07 [-0.01, 0.15] 1.41 [0.97, 2.06]
#> 3 Equal Opportunity -0.14 [-0.29, 0.01] 0.80 [0.63, 1.02]
#> 4 Predictive Equality 0.03 [0, 0.06] 2.00 [0.95, 4.19]
#> 5 Balance for Positive Class 0.10 [0.04, 0.16] 1.29 [1.1, 1.5]
#> 6 Balance for Negative Class 0.04 [0.02, 0.06] 1.36 [1.16, 1.6]
#> 7 Positive Predictive Parity -0.04 [-0.24, 0.16] 0.94 [0.68, 1.29]
#> 8 Negative Predictive Parity 0.01 [-0.03, 0.05] 1.01 [0.97, 1.05]
#> 9 Brier Score Parity 0.01 [-0.01, 0.03] 1.12 [0.87, 1.46]
#> 10 Overall Accuracy Parity 0.01 [-0.04, 0.06] 1.01 [0.94, 1.09]
#> 11 Treatment Equality -1.86 [-4.27, 0.55] 0.38 [0.16, 0.9]
The result returned here divides the results between the model performance and the fairness criteria.
While fairmetrics
focuses only of assessing fairness of
models accross binary protected attributes, it is possible to work with
protected attributes which consist of more than two groups by using
“one-vs-all” comparisons and a little bit of data wrangling to create
the appropriate columns.
Appendix: Confidence Interval Construction
The function get_fairness_metrics()
computes Wald-type
confidence intervals for both group-specific and disparity metrics using
nonparametric bootstrap. To illustrate the construction of confidence
intervals (CIs), we use the following example involving the false
positive rate
().
Let and denote the estimated and true FPR in group . Then the difference satisfies (e.g., Gronsbell et al., 2018):
We estimate the standard error of using bootstrap resampling within groups, and form a Wald-style confidence interval:
For ratios, such as , we apply a log transformation and use the delta method:
Exponentiation of the bounds yields a confidence interval for the ratio on the original scale:
References
Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. https://doi.org/10.13026/C2NC7F.
Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Gao, J. et al. What is Fair? Defining Fairness in Machine Learning for Health. arXiv.org https://arxiv.org/abs/2406.09307 (2024).
Gronsbell, J. L. & Cai, T. Semi-Supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B (Statistical Methodology) 80, 579–594 (2017).
Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive survey. arXiv.org https://arxiv.org/abs/2207.07068 (2022).
Hsu, D. J. et al. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure. CHEST Journal 148, 1470–1476 (2015).
Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, (1986).