Study Reveals a Decrease in ChatGPT’s Accuracy

Stanford and UC Berkeley researchers found that the LLM’s skill in creating computer code became less good in just a few months.

Facebook
Twitter
LinkedIn
Pinterest
Pocket
WhatsApp

Two new studies raise concerns about OpenAI’s ChatGPT large language model. On one hand, it can generate text that’s almost as good as what humans write, according to several studies and sources.

But on the other hand, it seems to be becoming less accurate as time goes on. What’s more troubling is that no one has clear explanation for why this happening.

Researchers from Stanford and UC Berkeley conducted a study published on Tuesday and found that ChatGPT’s behavior has changed over time, and not for the better. They’re somewhat puzzled about why quality of its responses is declining.

To check how consistent ChatGPT’s core programs, GPT-3.5 and -4, are, the researchers ran tests to see if the AI would sometimes give different quality and accuracy of answers. They also checked how well it could follow some commands.

The researchers asked both ChatGPT-3.5 and -4 to do things like solve math problems, respond to sensitive and risky questions, use visual cues for reasoning, and create computer code.

In their evaluation, the team discovered that the behavior of the same LLM service can change significantly in a relatively short period, underscoring the importance of continuously monitoring LLM quality. For instance, in March 2023, GPT-4 showed an impressive 98 percent accuracy in identifying prime numbers, but by June, its accuracy dropped drastically to less than 3 percent for the same task.

Meanwhile, GPT-3.5 in June 2023 showed improvement in identifying prime numbers compared to its March 2023 version. However, when it came to generating computer code, both versions performed worse between March and June.

These variations could have real-world consequences, especially in fields like healthcare. A recent paper by researchers from NYU revealed that ChatGPT’s responses to healthcare-related queries were almost indistinguishable from those of human medical professionals in terms of tone and phrasing.

Participants had difficulty telling whether responses came from a human healthcare provider or the OpenAI language model, raising concerns about AI’s ability to handle medical data privacy and its potential to provide inaccurate information.

Academics and users have also taken note of ChatGPT’s declining performance. OpenAI’s developer forum has been discussing the LLM’s progress, or lack thereof, with users expressing their disappointment & requesting an official response to address the issue.

One user even compared the LLM’s performance to going from being a great assistant sous chef to dishwasher, highlighting frustration among payed customers.

OpenAI’s approach of keeping its LLM research and development closed off from external review has faced significant criticism and pushback from industry experts and users. Matei Zaharia, one of the co-authors of the ChatGPT quality review paper, expressed his difficulty in understanding why these issues are occurring.

Zaharia, who is an associate professor of computer science at UC Berkeley and CTO for Databricks, suggested that reinforcement learning from human feedback (RLHF) might be facing limitations, along with fine-tuning, but he also acknowledged the possibility of system bugs.

While ChatGPT may demonstrate proficiency in basic Turing Test benchmarks, its inconsistent quality still raises significant challenges and concerns for the public. All the while, there are few barriers to their continued use and integration into everyday life.

Facebook
Twitter
LinkedIn
Pinterest
Pocket
WhatsApp

Never miss any important news. Subscribe to our newsletter.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Can’t get enough?

Never miss any important news. Subscribe to our newsletter.