5 charts showing large AI model progress vs. human benchmarks

6 min readMar 26, 2021

Benchmarks are a key indicator of progress in the AI field, and great progress has been made.

Large models are compared against large test sets of questions, images, and tasks. As a control, humans take these tests to set a benchmark for comparison vs. AI models. Over time, one of the easiest way to demonstrate industry advancement is progress on these benchmarks. In many cases AIs are not only meeting, but far exceeding, the human performance line, leading researchers to develop newer and harder benchmarks.

Thus, an even more interesting signal of AI industry progress is the gradual increasing of the bar by which we measure these models. Natural Language is a field where models have evolved consistently to beat human performance benchmarks on multiple, increasingly hard tests (Squad, Squad 2, GLUE, super glue) which forced researchers to continually find harder tasks for the models to solve.

Raising the bar on AI is nothing new, as we have seen computer models go from beating the words best chess player in the 90s (IBM Deep Blue) to a single new model from Google DeepMind being able to beat chess, go, and all Atari games.

(Yes, but can it beat my brother at Monopoly knowing that he cheats?).

Benchmark examples in 5 charts

Following this trend, I have pulled some of the key benchmark charts out of Stanford annual AI report, which demonstrate progress across different sectors of AI.

1. Imagenet

What: Imagenet is a large database of labeled pictures. The computer vision AI model is asked to state what is in the presented picture.

Wow: Vision models are not just good, but way better than human level accuracy at a high speed. While this is a narrow AI task, the accuracy enables vision models to be included into industry very easily (recognizing defects or process issues).

2. SUPERGLUE

What: SuperGLUE is an evolution of the GLUE (General Language Understanding Evaluation) benchmark which was previously bested by an AI model. The GLUE benchmark is a collection of nine different language understanding tasks measures common understanding of language task like “is this positive or negative sentiment?”, asking questions forward and backwards, or asking for summarizations.

Wow: The questions on this test are SAT reading comprehension level, so are relatively broad and challenging. More importantly, the GLUE model was released in 2018, and after human parity was achieved a harder SuperGLUE benchmark was released 2019. Human parity has now been achieved twice and a new benchmark is being developed to raise the bar again.

3. SQuAD

What: The Stanford Question Answering Data set v1.1 is a collection of 100,000 crowdsourced reading comprehension question/answer pairs drawn from Wikipedia. SQuAD2.0, introduced in 2018, and builds on this with 50,000 unanswerable questions designed to look like answerable ones. To perform well, the NLP model must determine when the correct answer is not available. (examples of questions: SQuAD — the Stanford Question Answering Dataset (rajpurkar.github.io))

Wow: Note the slow progress after human parity is achieved. In practice, these final few points of accuracy are very expensive and often involve the massive expensive compute to train.

4. Visual commonsense reasoning

What: VCR dataset is 250K questions based on pictures which require some amount of basic inference. In the example from Swingers above, it seems obvious that Jon Favreau is pointing because Ron Livingston got the pancakes. While there is general human agreement to the questions, the rationale could be based on cultural norms that are learned through experience. (“you had to be there”)

Wow: Models do a bad, but improving, job vs this dataset. The dataset demonstrates a rare capability of humans to play out a scene based on experience from a still photo. This dreaming and general understanding is so far an incredibly challenging problem for AI models.

5. Visual Question Answering

What: The VQA test set is a series of questions requiring external knowledge that is not in the picture. A sufficient model would need a broad range of general or prior knowledge to accomplish the task.

Wow: Human benchmark has not been achieved on this data set demonstrating the complexity of general understanding and pre-defined knowledge. Currently, the best way to make a model more general is to throw a ton of computing and pre-training at the problem. Cracking this benchmark will be an expensive and challenging proposition.

Financial impact

On narrow tasks like image recognition or translation, AI models have surpassed human capability and are moving toward 100% accuracy on benchmarks. In these fields, we’ve see a lot of industry adoption. Examples include vision systems to recognize defects included in manufacturing lines or automated translation and transcription included in your live Teams meetings. The ROI here is large and well defined.

As benchmarks continue to raise the bar on AI models, they are including more general understanding and prior knowledge to solve tasks. Human benchmarks have a continued advantage here. However, on language tasks, models like GPT-3 have begun to demonstrate human level SAT reading comprehension. In the future, these types of more generalized models will be able to augment human capability on more complex tasks like taking medical notes or summarizing a meeting.

Conclusion

The market for AI models doing these narrow tasks is large and well defined because we know the direct replacement costs of things like quality assurance on a manufacturing line or the dollars lost from defects. This defines the total addressable market of “narrow” AI well.

However, the more generalized models have a much less defined market size. How much productivity do we gain back from well summarized meetings without a human needing to waste time writing it? None. Lots. Hard to say.

While we can intuit that there is value there, these more general AI models may actually be creating new uncosted markets and generating value from productivity more than replacing individual, measurable tasks. I believe that this value, while nebulous, is certainly larger than the well defined market of narrow AI tasks and is the overall payout from the large model AI investments (mostly compute cost) will pay out as more benchmark levels are achieved.

Originally published by me at David Hall | LinkedIn ; For this and other AI Finance posts, please follow or connect with me on Linkedin.