We highlighted something similar in the multi-objective optimisation literature . Unfortunately, it looks like comparing benchmark scores between papers can be unreliable.
- _Algorithm A_ implemented by _Researcher A_ performs different to _Algorithm A_ implemented by _Researcher B_.
- _Algorithm A_ outperforms _Algorithm B_ in _Researcher A's_ study.
- _Algorithm B_ outperforms _Algorithm A_ in _Researcher B's_ study.
That's a simple case... and it can come down to many different factors which are often omitted in the publication. It can drive PhD students mad as they try to reproduce results and understand why theirs don't match!Reply
I like presenting the correlation of results between different benchmarks - I'd be interested in hearing to what extent this problem exists in more traditional benchmarking. One difference is that ML has this accuracy/quality component where in the past we've been more concerned with performance. Unfortunately this paper doesn't really address the long history of non-ML benchmarking, and I find it hard to believe no one has previously addressed the fragility of benchmark results.Reply
>> Thus when using a benchmark, we should also think about and clarify answers to several related questions: Do improvements on the benchmark correspond to progress on the original problem? (...) How far will we get by gaming the benchmark rather than making progress towards solving the original problem?
But what is the "original problem" and how do we measure progress towards solving it? Obviously there's not just one such problem - each community has a few of its own.
But in general, the reason that we waste so much time and effort on benchmarks in AI research (and that's AI in general, not just machine learning this time) is because nobody can really answer this fundamental question: how do we measure the progress of AI research?
And that in turn is because AI research is not guided by a scientific theory: an epistemic object that can explain current and past observations according to current and past knowledge, and make predictions of future observations. We do not have such a theory of artificial intelligence. Therefore, we do not know what we are doing, we do not know where we are going and we do not even know where we are.
This is the sad, sad state of AI research. If AI research has been reduced, time and again, to a spectacle, a race to the bottom of pointless benchmarks, that's because AI research has never stopped to take its bearings, figure out its goals (there are no commonly accepted goals of AI research) and establish itself as a science, with a theory - rather than a constantly shifting trip from demonstration to demonstration. 70 years of demonstrations!
I think the paper above manages to go on about benchmarks for 34 pages and still miss the real limitation of empirical-only evaluations in a field without a theoretical basis. That no matter what benchmarks you choose and how, without a theoretical basis, you'll never know what you're doing.Reply
Well... none of the papers are up to the standards of the Money Laundering crowd.Reply
For a minute I thought someone scooped our SIGBOVIK paper: https://madaan.github.io/res/papers/sigbovik_real_lottery.pd...!Reply