This paper manually annotates a small text simplification dataset, focusing on sentence splitting. They had four people each simplify 359 sentences from the test set of an existing dataset.
-
-
Prikaži ovu nit
-
They then evaluated how well human judgments on this data correlated with BLEU scores. I had a bit of a hard time following the methodology, but I _think_ the following is correct.
Prikaži ovu nit -
They took the manual simplification, and the output of various systems, and calculated BLEU against the reference set from the original data. They then compared this with human judgments of grammaticality, meaning preservation, and simplicity.
Prikaži ovu nit -
The conclusion: BLEU correlates _negatively_ with simplicity, and hardly at all with grammaticality or meaning. Even worse, their new manual annotations perform _worse_ according to BLEU than _the input sentences_ (i.e., with no simplification).
Prikaži ovu nit -
They get similarly poor correlations with human evaluations when using their new manual annotations as a reference set (instead of what came with the original data).
Prikaži ovu nit -
The upshot: don't use BLEU for evaluating simplification; it's worse than useless. Someone needs to do a similar kind of study for question answering and other places where BLEU is used outside of MT (I think it's been done for summarization, but don't have a reference handy).
Prikaži ovu nit
Kraj razgovora
Novi razgovor -
-
-
Interesting will need to read. It kinda makes sense for such specific task as summarization / paraphrasing. There have been work by Reiter et al 2018 that came to conclusion basically that it correlate with human evaluation only on MT and doesn’t on NLG.
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
-
-
Though he tested mostly on MT and NLG and the other tasks were too few tests imho
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.