The latest in a slew of speculative AI research papers is making some pretty outlandish claims about how deep learning models have some subtle, unrealized cognitive abilities akin to, or even surpassing humans. Though researchers found a modern pre-training transformer model does well at multiple-choice tests that don’t necessarily require language, they still have no honest idea if the AI is just basing its answers off its opaque training data.
University of California, Los Angeles researchers tested “analogical tasks” on the GPT-3 large language model and found it was at or above “human capabilities” for resolving complex reasoning problems. UCLA was fast to make rather outlandish claims about the research in its press release Monday, raising the question of whether the AI was “using a fundamentally new kind of cognitive process.”
That’s an inherently biased question that relies on a sensational view of AI systems, but let’s look a bit deeper. The UCLA psychology postdoc researcher Taylor Webb and professors Keith Holyoak and Hongjing Lu published their paper in the journal Nature Human Behaviour. They compared the AI’s answers to that of 40 undergrad students and found the bot performed at the higher end of humans’ scores, and that it even made a few of the same mistakes.
In particular, the researchers based their tests on the non-verbal test called Raven’s Progressive Matrices developed all the way back in 1939. It’s a list of 60 multiple-choice questions that get harder as they go along, and they mostly require test takers to identify a pattern. Some have extrapolated Raven’s to measure IQ as a score for general cognitive ability, especially since some proponents say it doesn’t hold many ethnic or cultural biases compared to other, inherently biased intelligence tests.
Thankfully, the paper doesn’t try to ascribe a bunk IQ score to the AI. They also asked the bot to solve a set of SAT analogy questions that involved word pairs. Say a vegetable is related to a cabbage. Therefore, an insect is equivalent to a “beetle,” and so on. The researchers claimed that, to their knowledge, the questions had not appeared on the internet and that it was “unlikely” it would have been gobbled up as part of GPT-3’s training data. Again, the AI performed at a level slightly above the average meat bag.
There are several problems the AI sucks at, or perhaps it’s just more of a STEM kid than a humanities student. It was much less capable of solving analogy problems based on short stories, though the newer, more expansive GPT-4 was overall better at the task. Asked to use a bunch of household objects to transfer gumballs from one room to another, and the AI came up with “bizarre solutions.”
Webb and his fellows have been working on this problem for close to half a year, and since their initial preprint they’ve added more tests to the model. All these tests led them to start openly theorizing about how GPT-3 could be forming some kind of “mapping process” similar to how humans are theorized to tackle such problems. The researchers jumped at the idea that AI could have developed some alternate type of machine intelligence.
The “spatial” part of the tests would often involve shapes, and it required the AI to guess the correct shape or diagram based on previous, similar shapes. The study authors went on to further draw comparisons to flesh and blood test takers, saying that the AI shared many similar features of “human analogical reasoning.” Essentially, the researchers said that the AI was reasoning in the same ways that humans did through having a sense of the comparison of shapes.
Webb and his colleagues first released a preprint of the paper in December. There, the researcher claimed GPT-3 didn’t have “any training” on these tests or related tasks.
There is a fundamental problem with anybody trying to claim that there’s something the AI isn’t trained on. Is it possible there’s absolutely nothing language-based on the Raven’s test in the 45 full terabytes of training data used by the AI? Perhaps, but GPT-3-creator OpenAI has not released a full list of what’s contained inside the data set that their LLM learned from. This is for a few reasons, one is to keep their proprietary AI under lock and key to better sell their services. The second is to keep even more people from suing them for copyright infringement.
Previously, Google CEO Sundar Pichai claimed in an interview that somehow, Google’s Bard chatbot learned Bengali on its own. The thing is, researchers found Bengali and other overlapping languages already existed in the training data. Most of AI’s data is centered on English and the “West,” but it’s learning is so broad and covers such a vast range of information there’s a chance that some example of language-less problem-solving slipped in there.
The UCLA release even mentions that the researchers have no idea how or why the AI does any of this since they don’t have access to OpenAI’s secret sauce. What this paper and others like it do is create even more hysteria about the AI containing some form of actual “intelligence.” OpenAI CEO has run on at length about the concerns of Artificial General Intelligence, a kind of computer system that’s actually smart. But what that means in practice is nebulous. Altman described GPT-4 as an “alien intelligence,” in an interview with The Atlantic where he also described the AI writing computer code it wasn’t explicitly programmed to do.
But it’s also a shell game. Altman won’t release what’s in the AI’s training data, and because it’s a big black box the company, AI proponents, and even well-meaning researchers can get suckered into the hype with claims the language models are breaking free from the digital cage containing it.
Want to know more about AI, chatbots, and the future of machine learning? Check out our full coverage of artificial intelligence, or browse our guides to The Best Free AI Art Generators, The Best ChatGPT Alternatives, and Everything We Know About OpenAI’s ChatGPT.