Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Webshop

Benchmarks

Task NameDataset NameSOTA ResultTrend
Web Navigation and ShoppingWebshop
Success Rate82.8
81
E-commerce Navigation and SearchWebShop semantic shift Hidden drift
Score100
63
Interactive web-based shopping tasksWebShop
Score92.2
60
Web-based Agent InteractionWebShop (test)
Success Rate73
42
Web-based Agent InteractionWebShop
CoT Match Rate100
41
Interactive Decision MakingWebShop
Success Rate84.02
36
Web-based Agent InteractionWebShop (val)
Success Rate84.4
31
Agent TaskWebShop
Success Rate99
30
Interactive Decision MakingWebShop (test)
Score93.1
28
Web NavigationWebShop Source
Success Rate100
27
Interactive Decision-makingWebShop
Real39
24
Prompt-level Targeted Bit-flip AttackWebShop
CDA100
24
Internal-trigger targeted bit-flip attackWebShop (test)
CDA0.95
24
Web TaskWebShop
Average Reward69.2
24
Online ShoppingWebshop
LLM Score0.63
22
World ModelingWebshop (test)
Search100
20
Web-based ReasoningWebShop
Average Reasoning Length (tokens)34.8
18
Web NavigationWebShop Drift II
Success Rate95
18
Web NavigationWebShop Drift I
Success Rate95
18
Online ShoppingWebShop Source
Score100
18
Web NavigationWebShop Drift II - Semantic Shift
Success Rate95
18
Web NavigationWebShop Drift I - Semantic Shift
Success Rate95
18
E-commerce Navigation and SearchWebShop semantic shift Source
Score1
18
Agent Behavior AdaptationWebShop (WS) (test)
Loop Ratio36.7
17
Next-state predictionWebShop (WS)
EM Accuracy79.05
16
Showing 25 of 59 rows