Skip to contents

Introduction

This vingette demonstrates how to obtain, report and interpret model fairness metrics for binary protected attributes with the fairmetrics package. We illustrate this through a case study based on a preprocessed version of the MIMIC II clinical database [1], which has been previously studied to explore the relationship between indwelling arterial catheters in hemodynamically stable patients and respiratory failure in relation to mortality outcomes [2]. The original, unprocessed dataset is publicly available through PhysioNet [3]. A preprocessed version of this dataset is included in the fairmetrics package as the mimic_preprocessed and is used in this vingette.

Data Split and Model Construction

In this setting, we construct a model that will predict 28-day mortality (day_28_flg). To do this, we split the dataset into a training and testing sets and fit a random forest model. The first 700 patients are used as the training set and the remaining patients are used as the testing set. After the model is fit, it is used to predict 28-day mortality. The predicted probabilities are saved as new column in the testing data and are used to assess model fairness.

# Load required libraries
library(dplyr)
library(pROC)
library(fairmetrics)
# Set seed for reproducibility
set.seed(1)
# Load the MIMIC-II preprocessed data
data("mimic_preprocessed")
# Use 700 labels to train the mimic
train_data <- mimic_preprocessed %>%
  filter(row_number() <= 700)

# Test the model on the remaining data
test_data <- mimic %>%
  filter(row_number() > 700)

# Fit a random forest model
rf_model <- randomForest(factor(day_28_flg) ~ ., data = train_data, ntree = 1000)
# Save model predcition
test_data$pred <- predict(rf_model, newdata = test_data, type = "prob")[, 2]

Fairness Evaluation

The fairmetrics package is used to assess model fairness across binary protected attributes. This means that the unique values in the protected attribute column need to be exactly two. To evaluate fairness of the random forest model which we fit, we examine patient gender as the binary protected attribute.

# Recode gender variable explicitly for readability: 
test_data <- test_data %>%
  mutate(gender = ifelse(gender_num == 1, "Male", "Female"))

Since many fairness metrics require binary predictions, we threshold the predicted probabilities using a fixed cutoff. We set a threshold of 0.41 to maintain the overall false positive rate (FPR) at approximately 5%. To evaluate specific fairness metrics, its possible to do so with the eval_* functions (for a list of the functions contained in fairmetrics see here). For example, if we are interested in calculating the statistical parity of our model across gender (here assumed to be binary), we write:

eval_stats_parity(
  data = test_data, 
  outcome = "day_28_flg",
  group = "gender",
  probs = "pred",
  cutoff = 0.41,
  message = TRUE
)
#> There is evidence that model does not satisfy statistical parity.
#>                     Metric GroupFemale GroupMale Difference 95% Diff CI Ratio
#> 1 Positive Prediction Rate        0.12      0.06       0.06 [0.02, 0.1]     2
#>   95% Ratio CI
#> 1 [1.32, 3.03]

The dataframe returned gives the positive prediction rate between the groups defined by the binary protected attribute (GroupFemale and GroupMale in this case), the difference and ratios between the groups and the bootstrap calculated confidence intervals for the estimated difference and ratios. For inference, it can be considered that confidence interval which contains 0 in its range for difference or 1 for ratio as insignificant.

All the eval_* functions follow a the same syntax with the exception of eval_cond_stats_partiy() which is used to evaluate conditional statistical parity, which requires a group to condition on. If we are interested in calculating statistical parity across male and female patients aged 60 and up, we would write:

eval_cond_stats_parity(
  data = test_data, 
  outcome = "day_28_flg",
  group = "gender",
  group2 = "age",
  condition = ">=60",
  probs = "pred",
  cutoff = 0.41,
  message = TRUE
)
#> There is not enough evidence that the model does not satisfy
#>             statistical parity.
#>                     Metric GroupFemale GroupMale Difference 95% Diff CI Ratio
#> 1 Positive Prediction Rate        0.24      0.17       0.07   [0, 0.14]  1.41
#>   95% Ratio CI
#> 1 [0.97, 2.05]

To calculate various fairness metrics for the model simultaneously, we pass our test data with its predicted results into the get_fairness_metrics function.

get_fairness_metrics(
  data = test_data,
  outcome = "day_28_flg",
  group = "gender",
  group2 = "age",
  condition = ">=60",
  probs = "pred",
  cutoff = 0.41
 )
#> $performance
#>                                     Metric GroupFemale GroupMale
#> 1                 Positive Prediction Rate        0.12      0.06
#> 2     Conditional Positive Prediction Rate        0.24      0.17
#> 3                      False Negative Rate        0.56      0.70
#> 4                      False Positive Rate        0.06      0.03
#> 5            Avg. Predicted Positive Prob.        0.45      0.35
#> 6            Avg. Predicted Negative Prob.        0.15      0.11
#> 7                Positive Predictive Value        0.61      0.65
#> 8                Negative Predictive Value        0.92      0.91
#> 9                              Brier Score        0.09      0.08
#> 10                                Accuracy        0.74      0.73
#> 11 (False Negative)/(False Positive) Ratio        1.14      3.00
#> 
#> $fairness
#>                            Metric Difference   95% Diff CI Ratio 95% Ratio CI
#> 1              Statistical Parity       0.06   [0.02, 0.1]  2.00 [1.33, 3.01]
#> 2  Conditional Statistical Parity       0.07 [-0.01, 0.15]  1.41 [0.97, 2.06]
#> 3               Equal Opportunity      -0.14 [-0.29, 0.01]  0.80 [0.63, 1.02]
#> 4             Predictive Equality       0.03     [0, 0.06]  2.00 [0.95, 4.19]
#> 5      Balance for Positive Class       0.10  [0.04, 0.16]  1.29   [1.1, 1.5]
#> 6      Balance for Negative Class       0.04  [0.02, 0.06]  1.36  [1.16, 1.6]
#> 7      Positive Predictive Parity      -0.04 [-0.24, 0.16]  0.94 [0.68, 1.29]
#> 8      Negative Predictive Parity       0.01 [-0.03, 0.05]  1.01 [0.97, 1.05]
#> 9              Brier Score Parity       0.01 [-0.01, 0.03]  1.12 [0.87, 1.46]
#> 10        Overall Accuracy Parity       0.01 [-0.04, 0.06]  1.01 [0.94, 1.09]
#> 11             Treatment Equality      -1.86 [-4.27, 0.55]  0.38  [0.16, 0.9]

The result returned here divides the results between the model performance and the fairness criteria.

While fairmetrics focuses only of assessing fairness of models accross binary protected attributes, it is possible to work with protected attributes which consist of more than two groups by using “one-vs-all” comparisons and a little bit of data wrangling to create the appropriate columns.

Appendix: Confidence Interval Construction

The function get_fairness_metrics() computes Wald-type confidence intervals for both group-specific and disparity metrics using nonparametric bootstrap. To illustrate the construction of confidence intervals (CIs), we use the following example involving the false positive rate (FPRFPR).

Let FPR̂a\widehat{\textrm{FPR}}_a and FPRa\textrm{FPR}_a denote the estimated and true FPR in group A=aA = a. Then the difference Δ̂FPR=FPR̂a1FPR̂a0\widehat{\Delta}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} - \widehat{\textrm{FPR}}_{a_0} satisfies (e.g., Gronsbell et al., 2018):

n(Δ̂FPRΔFPR)d𝒩(0,σ2) \sqrt{n}(\widehat{\Delta}_{\textrm{FPR}} - \Delta_{\textrm{FPR}}) \overset{d}{\to} \mathcal{N}(0, \sigma^2)

We estimate the standard error of Δ̂FPR\widehat{\Delta}_{\textrm{FPR}} using bootstrap resampling within groups, and form a Wald-style confidence interval:

Δ̂FPR±z1α/2sê(Δ̂FPR) \widehat{\Delta}_{\textrm{FPR}} \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}(\widehat{\Delta}_{\textrm{FPR}})

For ratios, such as ρ̂FPR=FPR̂a1/FPR̂a0\widehat{\rho}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} / \widehat{\textrm{FPR}}_{a_0}, we apply a log transformation and use the delta method:

log(ρ̂FPR)±z1α/2sê[log(ρ̂FPR)] \log(\widehat{\rho}_{\textrm{FPR}}) \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]

Exponentiation of the bounds yields a confidence interval for the ratio on the original scale:

[exp{log(ρ̂FPR)z1α/2sê[log(ρ̂FPR)]},exp{log(ρ̂FPR)+z1α/2sê[log(ρ̂FPR)]}]. \left[ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) - z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\},\ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) + z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\} \right].

References

  1. Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. https://doi.org/10.13026/C2NC7F.

  2. Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham

  3. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

  4. Gao, J. et al. What is Fair? Defining Fairness in Machine Learning for Health. arXiv.org https://arxiv.org/abs/2406.09307 (2024).

  5. Gronsbell, J. L. & Cai, T. Semi-Supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B (Statistical Methodology) 80, 579–594 (2017).

  6. Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive survey. arXiv.org https://arxiv.org/abs/2207.07068 (2022).

  7. Hsu, D. J. et al. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure. CHEST Journal 148, 1470–1476 (2015).

  8. Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, (1986).