[1911.02508] How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we de

10 mentions: @hima_lakkaraju@hima_lakkaraju@ai4life_harvard@berilsirmacek@pm_girl@CarlRioux
Keywords: lime shap
Date: 2019/11/08 02:21

Referring Tweets

@hima_lakkaraju Wondering if you can game explainability methods (e.g. LIME/SHAP) to say whatever you want to? Turns out you can! More details in our recent research: t.co/ihFHTCkH4E t.co/pU3QIOBMFg
@ai4life_harvard Wondering if you can game explainability methods (e.g. LIME/SHAP) to say whatever you want to? Turns out you can! You should not miss our recent research: t.co/kJQBzL5Rjl
@hima_lakkaraju Want to know how adversaries can game explainability techniques? Our latest research - "How can we fool LIME and SHAP? Adversarial Attacks on Explanation Methods" has answers: t.co/Bcx2geO3mv. Joint work with the awesome team: @dylanslack20, Sophie, Emily, @sameer_

Related Entries

Read more LIMEで機械学習の予測結果を解釈してみる - QiitaQiita
0 users, 0 mentions 2018/09/06 09:24
Read more GitHub - marcotcr/lime: Lime: Explaining the predictions of any machine learning classifier
16 users, 3 mentions 2019/11/05 05:20
Read more Understanding how LIME explains predictions – Towards Data Science
1 users, 0 mentions 2018/12/28 10:56
Read more GitHub - limexp/xgbfir: XGBoost Feature Interactions Reshaped
5 users, 0 mentions 2018/12/21 10:45