Model monitoring in the absence of labeled data via feature attributions distributions

Model monitoring involves analyzing AI algorithms once they have been deployed and detecting changes in their behaviour.
This thesis explores machine learning model monitoring ML before the predictions impact real-world decisions or users. This step is characterized by one particular condition: the absence of labelled data at test time, which makes it challenging, even often impossible, to calculate performance metrics.

The thesis is structured around two main themes: \emph{(i) AI alignment}, measuring if AI models behave in a manner consistent with human values and \emph{(ii) performance monitoring}, measuring if the models achieve specific accuracy goals or desires.

The thesis uses a common methodology that unifies all its sections. It explores feature attribution distributions for both monitoring dimensions. Using these feature attribution explanations, we can exploit their theoretical properties to derive and establish certain guarantees and insights into model monitoring.

For AI Alignment, we explore whether the distributions of feature attributions are distinct for different social groups and propose a new formalization of equal treatment. This novel metric assesses how well AI decisions adhere to ethical standards and political-philosophical values. Our notion of Equal Treatment tests for statistical independence of the explanation distributions over populations with different protected characteristics. We show the theoretical properties of our formalization of equal treatment and devise an equal treatment inspector based on the AUC of a classifier two-sample test.

For performance monitoring, we define \emph{explanation shift} as the statistical comparison between how predictions from training data are explained and how predictions on new data are explained. We propose explanation shift as a key indicator to investigate the interaction between distribution shifts and learned models. We introduce an Explanation Shift Detector that operates on the explanation distributions, providing more sensitive and explainable changes in interactions between distribution shifts and learned models. We compare explanation shifts with other methods that are based on distribution shifts, showing that monitoring for explanation shifts results in more sensitive indicators for varying model behavior. We provide theoretical and experimental evidence and demonstrate the effectiveness of our approach on synthetic and real data.

Finally, to explain model degradation we use a second model, that predicts the uncertainty estimates of the first

Additionally, we release two open-source Python packages, \texttt{skshift} and \texttt{explanationspace}, which implement our methods and provide usage tutorials for further reproducibility.

University of Southampton

Mougan, Carlos

fdfb61c6-eb26-4f8d-87bf-958a2f2234d0

February 2025

Mougan, Carlos

fdfb61c6-eb26-4f8d-87bf-958a2f2234d0

Staab, Steffen

bf48d51b-bd11-4d58-8e1c-4e6e03b30c49

Tiropanis, Thanassis

d06654bd-5513-407b-9acd-6f9b9c5009d8

Mougan, Carlos (2025) Model monitoring in the absence of labeled data via feature attributions distributions. University of Southampton, Doctoral Thesis, 155pp.

Record type: Thesis (Doctoral)