介绍
grid world with cliff:
| 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 |
| s | cliff | cliff | cliff | goal |
问题: 给定策略 Π = a(s) , 相对应的Q(s,a(s))=?
on-policy: 学习过程中使用同样的策略a(s)
off-policy: 学习中使用的策略和最后估算的策略不同(使用non-optimal policy, 估算 optimal policy)
ϵ-greedy
ϵ-greedy(Epsilon Greedy): always exploit, sometimes explore
exploit greedy(p=1-ε): At = argmaxQt(a)
explore(p=ε): A_t = random(a)
On-policy TD Control
不宜采用随机政策
根据我的行动策略a(s), 这个状态有多好。
TD algorithm的学习对象 :V(s) -> Q(s,a)
TD learning:
V(St)<–V(St) + α[Rt+1 + γV(St+1)-V(St)]
on policy TD COntrol(SARSA) :
Q(St,At) <– Q(St,At) + α[Rt+1 +γQ(St+1,At+1)-Q(St,At)]
代码实现
1 | # environment: grids with size m*n; goal / cliff grid / start point(down left coner) |
1 | import numpy as np |
1 | # Environment setup: Sutten book example 6.6 cliff walk |
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100. 1.]]
1 | # learning setup |
1 | # start learning |
训练200次后,基本确定了步数为选择最上面的保守路线
赏
使用支付宝打赏
使用微信打赏
若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏
扫描二维码,分享此文章