I will say that I implicitly agree that adversarial examples do not (necessarily) feel like compelling explanations (especially if they are disfluent; perhaps also depending on the degree of minimality). But, under the conditions for a CE and ACE, shouldn't TAPs be valid?
Conversation
Now, I don't closely follow the adversarial literature, but my understanding is that in vision, it is possible to synthesize targeted adversarial attacks (especially with glassbox access). Perhaps given that adversarial attacks are harder to generate in NLP, the emphasis ...
1
you are placing is on the shortcoming of current adversarial methods (but that you believe sufficiently improved TAP generators are valid ACE generators). If so, I should note that this was at least very unclear to me; my reading was you viewed these as separate but related.
1
Independent of this, I thought I would point to this recent work of Golan et al. (PNAS 2020; pnas.org/content/117/47), which also proposes (in my mind) a different way of generalizing adversarial examples to better understand models (here, to adjudicate which is more human-like).
1
2
Replying to
Hi Rishi, thanks for your question! You are right that TAPs and ACEs both satisfy constraints 1-3 but not 4. However, it does not follow that methods to generate TAPS are also good generators of ACEs—I’ll try to clarify here why.
1
1
Firstly, many current methods exploit the assumption that the true label should stay the same (constraint 4) in their methods by using semantics-preserving operations such as word substitutions (see Related Work in aclweb.org/anthology/2020 for a longer discussion of this).
1
This means that many ACEs we’d want our methods to generate would not count as TAPs and thus would be excluded by adversarial methods.
2
To make this difference more concrete, imagine a model makes a correct prediction originally, and an ACE results in an input for which the model changes its output to another label that a human would also give for that edited input.
1
An example of this kind of edit is the first example in Table 5 in our appendix. This edit would not qualify as a TAP, given that the human/true label for the edited input would also change with the edit. Thus, TAP methods would not generate this edit.
1
As another example, for the input in Table 5, if we saw that only editing “3/10” -> “9/10" led to the contrast prediction, this edit would be an ACE that highlights a dataset artifact—a reliance on the numerical rating.
1
However, this edit would not be a good TAP, since it’s unclear what the true label for this edited input is due to its mixed signals. Thus, TAP methods may be designed to exclude edits w/ mixed signals, though such examples are of interest to ACE generation methods.
Secondly, constraint 4 points to a larger difference in the goals of TAPs and ACEs—Adversarial examples are meant to deceive, not interpret, models. (This chapter offers a nice discussion of this difference: christophm.github.io/interpretable-).
1
This goal differs from the goal of ACEs, which is to explain. For explanation purposes, the ACE in Table 5 is still useful, even though it did not deceive the model, as it allows us to verify that the model got the initial prediction right for the right reasons.
1
Show replies

