Independent of this, I thought I would point to this recent work of Golan et al. (PNAS 2020; pnas.org/content/117/47), which also proposes (in my mind) a different way of generalizing adversarial examples to better understand models (here, to adjudicate which is more human-like).
Conversation
Replying to
Hi Rishi, thanks for your question! You are right that TAPs and ACEs both satisfy constraints 1-3 but not 4. However, it does not follow that methods to generate TAPS are also good generators of ACEs—I’ll try to clarify here why.
1
1
Firstly, many current methods exploit the assumption that the true label should stay the same (constraint 4) in their methods by using semantics-preserving operations such as word substitutions (see Related Work in aclweb.org/anthology/2020 for a longer discussion of this).
1
This means that many ACEs we’d want our methods to generate would not count as TAPs and thus would be excluded by adversarial methods.
2
To make this difference more concrete, imagine a model makes a correct prediction originally, and an ACE results in an input for which the model changes its output to another label that a human would also give for that edited input.
1
An example of this kind of edit is the first example in Table 5 in our appendix. This edit would not qualify as a TAP, given that the human/true label for the edited input would also change with the edit. Thus, TAP methods would not generate this edit.
1
As another example, for the input in Table 5, if we saw that only editing “3/10” -> “9/10" led to the contrast prediction, this edit would be an ACE that highlights a dataset artifact—a reliance on the numerical rating.
1
However, this edit would not be a good TAP, since it’s unclear what the true label for this edited input is due to its mixed signals. Thus, TAP methods may be designed to exclude edits w/ mixed signals, though such examples are of interest to ACE generation methods.
1
Secondly, constraint 4 points to a larger difference in the goals of TAPs and ACEs—Adversarial examples are meant to deceive, not interpret, models. (This chapter offers a nice discussion of this difference: christophm.github.io/interpretable-).
1
This goal differs from the goal of ACEs, which is to explain. For explanation purposes, the ACE in Table 5 is still useful, even though it did not deceive the model, as it allows us to verify that the model got the initial prediction right for the right reasons.
1
This larger difference in goals may also influence methodology in the additional ways, which point to interesting directions for future work: 1) Unlike in work on adv. examples, the goal of research on contrastive edits is to achieve strict minimality as we discuss in Sect. 5.
2) Another goal of ACEs is to create edits that are immediately understandable to people, which could require different, more human-centered definitions of minimality.
1
For instance, we may want our edits to be minimal in the sense of having edits be contiguous, since connected edits are more understandable than edits that alter disconnected parts of input. This is also not of interest to work on adversarial examples.
1
Show replies

