In recent times, there have been an increasing number of reports and discussions about a decline in the quality of ChatGPT responses. To investigate this matter, a team of researchers from Stanford and UC Berkeley conducted a study to quantify the extent of this degradation. The study confirmed that the drop in ChatGPT quality was real.
ChatGPT’s mathematical accuracy drops to a shocking 2%; The quality of the response deteriorates
The research paper titled “How does ChatGPT behavior change over time?” It was written by three prominent scholars: Matei Zaharia, Lingjiao Chen and James Zou. Matei Zaharia, a professor of Computer Science at UC Berkeley, shared the findings on Twitter, revealing a startling fact: GPT-4’s success rate in solving certain problems dropped dramatically from 97.6% to 2.4%. between March and June.
GPT-4, which was recently released and hailed as the most advanced model of OpenAI, was eagerly awaited by developers for its potential to drive innovative AI products. However, the study results showed disappointing performance, especially in handling simple queries.
The research team designed tasks to assess the quality of responses from the GPT-4 and GPT-3.5 large language models (LLM). These tasks covered areas such as solving mathematical problems, answering tricky questions, generating code, and visual reasoning. The graph provided an overview of the performance of both models in their March and June 2023 releases.
The data clearly illustrated that the same LLM service provided different responses over time, showing significant differences in performance within this short period. It remains uncertain how these LLMs are updated and whether changes to improve one aspect of their performance could negatively affect others. In particular, the latest version of GPT-4 performed worse compared to the March version in three test categories, with only a slight margin of improvement in visual reasoning.
While some may not be concerned about the varying quality in the “same versions” of these LLMs, it is crucial to recognize that both GPT-4 and GPT-3.5 have been widely adopted by individual users and enterprises due to the popularity of ChatGPT. . As such, the information generated by these models can have a significant impact on people’s lives.
The researchers intend to continue evaluating versions of GPT in a larger study. They suggest that OpenAI should consider monitoring and publishing regular quality checks for its paying customers. Otherwise, it may be necessary for business or government organizations to be vigilant about the basic quality metrics of these LLMs to avoid potential research and commercial impacts.