Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Policy Optimization benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Policy Optimization
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
Office World MAP0
QR-MAXRM
Avg Training Steps
4,150
18
1mo ago
Pandemic
Linear Max-Min
True Performance
3.65
8
4d ago
Traffic
ORPO
True Outcome
16.91
8
4d ago
Multi-Armed Bandits
Log-barrier
Sample Complexity
-7
8
1mo ago
Office World Map 3, Exp 5
QR-MAXRM
Average Training Steps
5,806
7
1mo ago
Office World Map 2 Exp 5
QR-MAXRM
Average Training Steps
3,767
7
1mo ago
Office World Map 4 Exp 6
QR-MAXRM
Average Training Steps
5,630
7
1mo ago
Office World Map 1, Exp 5
QR-MAXRM
Average Training Steps
3,125
7
1mo ago
Office World MAP4
QR-MAXRM
Average Training Steps
5,630
7
1mo ago
Office World MAP1
QR-MAXRM
Avg Training Steps
3,125
7
1mo ago
Glucose
ORPO*
True Outcome
6.3
6
4d ago
10 agents, random subsets of warehouses (test)
max-quantile
Gini Index
0.0625
6
1mo ago
5 symmetric agents, one per warehouse (test)
max-quantile
Gini Index
0.0188
6
1mo ago
RLHF
ORPO
True Score
8.3
5
4d ago
MuJoCo Suite Summary
MAX-RETURN
Average Normalized Performance
100
5
1mo ago
MuJoCo HalfCheetah H=40
MAX-RETURN
Return
49.1
5
1mo ago
MuJoCo HalfCheetah H=20
MAX-RETURN
Return
13.3
5
1mo ago
MuJoCo HalfCheetah H=10
OFF-SL
Return
2.8
5
1mo ago
MuJoCo Walker2d H=40
MAX-RETURN
Return
221.1
5
1mo ago
MuJoCo Walker2d H=20
MAX-RETURN
Return
60.7
5
1mo ago
MuJoCo Hopper H=40
MAX-RETURN
Return
71
5
1mo ago
Policy Action Space
Policy gradient
Preprocessing Time
0
1
1mo ago
s-rectangular Robust MDP Discounted Reward
-
-
0
1mo ago
(s, a)-rectangular Robust MDP Discounted Reward
-
-
0
1mo ago
Non-rectangular Robust MDP Average Reward
-
-
0
1mo ago
Showing 25 of 26 rows
25 / page
50 / page
100 / page
1
2
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs