Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Authors:

Xinnan Dai、Qihao Wen、Yifei Shen、Hongzhi Wen、Dongsheng Li、Jiliang Tang、Caihua Shan

Paper:

Introduction

Background

Large Language Models (LLMs) have demonstrated remarkable success in various reasoning tasks, including mathematical problem-solving, commonsense reasoning, and symbolic problem-solving. However, their ability to handle graph reasoning tasks remains under scrutiny. Graph reasoning is crucial for applications such as question-answering systems, autonomous planning, and robot navigation. Despite theoretical studies suggesting that LLMs can handle graph reasoning tasks, empirical evaluations reveal numerous failures.

Problem Statement

This study aims to revisit the graph reasoning ability of LLMs by focusing on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest path problem. The goal is to uncover the limitations of LLMs in understanding graph structures through text descriptions and to propose possible explanations for the observed discrepancies between theoretical capabilities and practical performance.

Related Work

Evaluation on Graph Reasoning Tasks

Recent efforts have been made to evaluate LLMs on graph reasoning tasks. Studies like NLGraph and GraphInstruct have extended the benchmarks for graph reasoning, suggesting that LLMs have preliminary graph reasoning abilities. VisionGraph has even provided a multimodal version of the graph reasoning task benchmark.

Graph Connectivity in Theory

Theoretical studies have shown that LLMs, through their transformer architecture, can simulate every step in a graph algorithm. This suggests that LLMs are capable of handling graph connectivity tasks, such as breadth-first search (BFS) and Dijkstra’s algorithm for the shortest path problem.

LLMs for Graphs in Applications

Despite their theoretical capabilities, LLMs often fail in practical graph reasoning tasks. Recent studies have validated the use of additional tools to enhance LLMs’ comprehension of graphs. For example, GraphEmb employs an encoding function to augment prompts with explicit structured information, and GraphWiz fine-tunes LLMs using graph reasoning datasets.

Theoretical Support for Graph Reasoning Tasks

Theoretical results suggest that LLMs can solve fundamental graph reasoning tasks by deconstructing them into subtasks. For example, graph connectivity and shortest-path tasks can be formulated as decision-making problems solvable by LLMs.

Research Methodology

Graph Description Translation

Graph Descriptions

Three types of graph structure description methods are widely used: Adjacency Matrix, Node List, and Edge List. These methods describe the graph’s structure in different ways, as shown in .

Translations on Graph Descriptions

To verify LLMs’ ability to understand graph structures, a graph translation description task was designed. This task requires LLMs to use the input graph description to generate various descriptions. The accuracy of these translations was evaluated using GPT-4 and LLAMA3.0-70B.

Graph Connectivity

Connectivity Types

Connectivity types were categorized into K-hops, Singleton, Isolated Components, and Asymmetric, as shown in . The distribution of these types in baseline datasets was analyzed, and a balanced dataset was constructed for a more comprehensive evaluation.

Dataset Construction

Graphs were generated with varying numbers of nodes and edges, and samples were collected for different connectivity types. The dataset was divided into Easy, Medium, and Hard levels based on the number of nodes.

Evaluation Metrics

Two novel metrics, FidelityAcc (Facc) and Path Consistency Ratio (PCR), were introduced to evaluate the correctness of reasoning paths and the consistency of paths selected by LLMs.

Shortest Path Problem

The shortest path problem was studied using both unweighted and weighted graphs. The performance of LLMs was evaluated on these datasets to understand their ability to handle more complex graph reasoning tasks.

Experimental Design

Planning and Designing the Experiment

Experiments were conducted using GPT-4, GPT-3, and LLAMA 3 with zero-shot settings. The datasets included both directed and undirected graphs with varying levels of difficulty.

Preparing the Data and Conditions

Graphs were randomly generated with node numbers ranging from 5 to 25. The datasets were divided into Easy, Medium, and Hard levels. Different naming methods for nodes were also evaluated to understand their impact on LLM performance.

Results and Analysis

Graph Description Translation

The results indicated that LLMs struggle with graph description translations, achieving reliable accuracy only when the source and target descriptions are identical. Performance declined significantly when translating between different types of descriptions, as shown in .

Graph Connectivity

Undirected Graph Results

GPT-4 demonstrated better reasoning ability compared to GPT-3 and LLAMA 3 across all cases. The difficulty of reasoning increased with the path length, and Node Lists generally performed better than Edge Lists, as shown in .

Directed Graph Results

LLMs had lower performance on directed graphs, particularly in the Asymmetric dataset. Descriptions using Node Lists outperformed those using Edge Lists, as shown in .

Shortest Path Problem

The performance of LLMs on the shortest path problem aligned with observations from graph connectivity tasks. Performance diminished as the path length increased, and undirected graphs consistently outperformed directed graphs, as shown in .

Analysis of Other Factors

Impact of Algorithm Prompts

In-context learning approaches, such as Chain-of-Thought (CoT), were revisited. Few-shot examples helped LLMs recognize isolated components, and the Dijkstra-CoT method improved performance in the shortest path cases, as shown in .

Influence of Node Names

Different naming methods for nodes yielded varied results. Sequentially ordered node names enhanced LLM performance, suggesting that LLMs could leverage memory recognition to predict connectivity more effectively, as shown in .

Overall Conclusion

This study revisited the graph reasoning ability of LLMs by focusing on three fundamental tasks: graph description translation, graph connectivity, and the shortest path problem. The results revealed that LLMs often fail in practical graph reasoning tasks, despite their theoretical capabilities. The study also highlighted the importance of balanced datasets and the influence of input context on LLM performance. Future research will focus on incorporating fine-tuning strategies to enhance the graph reasoning ability of LLMs.

Datasets:

GraphInstruct

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Authors:

Paper:

Introduction

Background

Problem Statement

Related Work

Evaluation on Graph Reasoning Tasks

Graph Connectivity in Theory

LLMs for Graphs in Applications

Theoretical Support for Graph Reasoning Tasks

Research Methodology

Graph Description Translation

Graph Descriptions

Translations on Graph Descriptions

Graph Connectivity

Connectivity Types

Dataset Construction

Evaluation Metrics

Shortest Path Problem

Experimental Design

Planning and Designing the Experiment

Preparing the Data and Conditions

Results and Analysis

Graph Description Translation

Graph Connectivity

Undirected Graph Results

Directed Graph Results

Shortest Path Problem

Analysis of Other Factors

Impact of Algorithm Prompts

Influence of Node Names

Overall Conclusion

Datasets:

Related Posts