Benchmarks can be very misleading, says Douwe Kiela at Facebook AI Research, who led the team behind the tool. Focusing too much on benchmarks can mean losing sight of wider goals. The test can become the task.
“You end up with a system that is better at the test than humans are but not better at the overall task,” he says. “It’s very deceiving, because it makes it look like we’re much further than we actually are.”
Kiela thinks that’s a particular problem with NLP right now. A language model like GPT-3 appears intelligent because it is so good at