Authors:
Paper:
https://arxiv.org/abs/2408.06752
Introduction
Evaluating the quality of academic research is a critical yet time-consuming task, essential for national research evaluation exercises, appointments, promotions, and tenure decisions. This paper investigates whether Large Language Models (LLMs), specifically ChatGPT, can assist in this process. The study examines which inputs (full text without tables, figures, and references; title and abstract; title only) produce better quality score estimates and how different ChatGPT models and system prompts affect these scores.
Research Questions
The study aims to answer the following research questions:
1. What is the optimal input for ChatGPT post-publication research quality assessment: full text, abstract, or title only?
2. What is the relationship between the number of ChatGPT iterations averaged and the usefulness (correlation with human judgment) of its predictions?
3. Does ChatGPT model choice affect post-publication research quality assessment?
4. Are simpler system prompts more effective than complex system prompts?
Methods
The research design involved running a series of experiments on a set of 51 articles with quality scores using the ChatGPT API environment. Each experiment was repeated 30 times to reveal the relationship between the number of iterations and the usefulness of the average score predictions. The study used different datasets (full text, abstract, title) and various ChatGPT models (3.5-turbo, 4o, 4o-mini) to assess their effectiveness.
Data
The dataset consisted of 51 information science journal articles, either published or prepared for submission, all written by the author. The articles were scored using the REF quality scale of 1, 2, 3, or 4. The texts were processed and cleaned to remove unnecessary elements like references and tables, and converted into a format suitable for input into the ChatGPT API.
GPT Prompt Setup
The main system prompt used was based on the REF guidelines for the research area, modified to instruct ChatGPT. Six variations of the basic REF prompt were tested to assess whether simpler prompts might be more effective. Each score request was a single API call specifying a ChatGPT model and including the system instructions.
Results
Input and Averaging Length Comparisons
The study found that averaging more iterations of ChatGPT 3.5-turbo increases the correlation with human scores, with the rate of increase diminishing as the number of iterations grows. The optimal input for ChatGPT 3.5-turbo was found to be article abstracts, which produced the highest correlation with human scores.
Model Comparison
The most powerful model, ChatGPT 4o, gave the best predictions, but the difference between models was not large. Given the cost differences, cheaper models like ChatGPT 3.5-turbo and 4o-mini are good alternatives unless the highest accuracy is needed.
Prompt Strategies
All seven system prompts gave positive results, but the most complex strategy was the most effective. The worst strategy was the one not requesting feedback in the ChatGPT report, suggesting that asking for an analysis helps ChatGPT judge scores better.
Individual Score Level Accuracy
Despite the correlations, the prediction scores were, on average, closer to a wrong score than a correct score. However, the accuracy of the predictions improved with linear regression, making them 31% closer to the correct value than baseline guesses.
Discussion
The study has limitations, including the small number of articles and the restriction to a single author’s work. The results could differ for other fields and more mixed sets of articles. Future, larger models or more precise system instructions might improve the results. The study confirms that abstracts are better inputs than full texts and that averaging multiple runs is more effective than single runs.
Conclusion
The optimal strategy for estimating the REF-like research quality of an academic journal article is to use a system prompt derived from the unabbreviated REF2021 instructions, feed the article title and abstract into ChatGPT 4o thirty times, and apply a regression formula to transform it into the human scale. This approach gives a high rank correlation with human evaluations but is still inaccurate for important applications like peer review or promotion decisions. More analysis is needed to understand the biases in ChatGPT’s predictions.
The study demonstrates that ChatGPT can assist in research quality evaluation, but its predictions are not yet reliable enough for critical decisions. Further research and improvements in LLMs are necessary to enhance their accuracy and applicability in academic evaluations.