"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces
About
Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to see how LLMs can be used to retrieve and locate important elements for a user given query (i.e. task description) in a web interface. In contrast with prior works, which primarily focused on autonomous web navigation, we decompose the problem as an even atomic operation - Can LLMs identify the important information in the web page for a user given query? This decomposition enables us to scrutinize the current capabilities of LLMs and uncover the opportunities and challenges they present. Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement. We hope our investigation will inspire follow-up works in overcoming the current challenges in this domain.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Web agent tasks | Mind2Web Cross-Task | Element Accuracy58 | 49 | |
| Conversational web navigation | MT-Mind2Web (Cross-Website) | Element Accuracy46.2 | 12 | |
| Conversational web navigation | MT-Mind2Web Cross-Subdomain | Element Accuracy47.4 | 12 |