On-policy learning | My Learning Channel

介绍

grid world with cliff：

0	1	2	3	4
5	6	7	8	9
10	11	12	13	14
s	cliff	cliff	cliff	goal

问题：给定策略 Π = a(s) ，相对应的Q(s,a(s))=?

on-policy: 学习过程中使用同样的策略a(s)

off-policy: 学习中使用的策略和最后估算的策略不同（使用non-optimal policy, 估算 optimal policy）

ϵ-greedy

ϵ-greedy(Epsilon Greedy): always exploit, sometimes explore

exploit greedy(p=1-ε): At = argmaxQt(a)

explore(p=ε）: A_t = random(a)

On-policy TD Control

不宜采用随机政策

根据我的行动策略a(s), 这个状态有多好。

TD algorithm的学习对象：V(s) -> Q(s,a)

TD learning:

V(St)<–V(St) + α[Rt+1 + γV(St+1)-V(St)]

on policy TD COntrol(SARSA) :

Q(St,At) <– Q(St,At) + α[Rt+1 +γQ(St+1,At+1)-Q(St,At)]

代码实现

1
2
3

# environment: grids with size m*n; goal / cliff grid / start point(down left coner)
# task: can be temporal discounting R(R(goal)=0, R(cliff)=-100，R(orginary)=-1 )
# learning algorithm: SARSA

1 2	import numpy as np import matplotlib.pyplot as plt

# Environment setup: Sutten book example 6.6 cliff walk
# grid configuration
gsize =[4,12]
s0=[gsize[0]-1,0] # initial state
gw=np.zeros([gsize[0],gsize[1]]) # 0 is orginary block
gw[gsize[0]-1,gsize[1]-1] = 1 # set goals
gw[gsize[0]-1,1:-1] = -100 # cliff
acts = ['u','d','l','r']
print(gw)

# action and transition matrix
def state_act(state, action, gsize):
    # action is a character of either u,d,l,r(up,down,left,right)
    # start is a 1*2 array, marking the current positon
    newstate = state[:]
    if action =='l' or action==2:
        newstate[1] = max(0,state[1]-1)
    elif action =='r'or action==3:
        newstate[1] = min(gsize[1]-1,state[1]+1)
    elif action =='u'or action==0:
        newstate[0] = max(0,state[0]-1)
    elif action =='d'or action==1:
        newstate[0] = min(gsize[0]-1,state[0]+1)
    else:
        raise ValueError("action note valid")
    
    # fall into the cliff, return to the initial
    if gw[newstate[0],newstate[1]] == -100:
        newstate = [0,0]
        
    return newstate

# reward setup
def reward(state, gw):
    #state represents the current postion; gw  is the setting of grid world
    if gw[state[0],state[1]] == 1: # goal
        R = 0 
    elif gw[state[0],state[1]] == -100: # cliff
        R = -100
    else:
        R=-1
    return R

[[   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.]
 [   0. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.    1.]]

# learning setup
A = .5 # learning rate 
gamma = 1 # no temporal discount for future state

def e_greedy(state,Q): #e-greedy

    e=0.1
    if np.random.rand(1)<e:
        action = np.random.randint(len(acts))
 #       print('随机选择action:',action)
    else:
        Q_now = Q[state[0]][state[1]]
        #action = np.argmax(Q_now)
        allmax = [i for i,j in enumerate(Q_now) if j == max(Q_now)] # find all actions of largest Qs
        action = allmax[np.random.randint(len(allmax))] # randomly select one 
   #     print("Q_now:",allmax,"Q action:",action)
    return action

# start learning
Nepis = 100 # total episode num
Q = np.zeros([gsize[0],gsize[1],len(acts)]) #Q(St,At)

for k in range(Nepis):
    s=s0 # initial point
    nstep =0 
    tot_r = 0 
    
    while gw[s[0],s[1]] != 1:
        a = e_greedy(s,Q) #action      
        s_new = state_act(s,a,gsize) # current state
        a_new = e_greedy(s_new,Q) # use the same policy as current state

        nstep = nstep + 1 #step
        pred_err = reward(s,gw) + gamma*Q[s_new[0],s_new[1],a_new] - Q[s[0],s[1],a]
        tot_r = tot_r+reward(s,gw)

        Q[s[0],s[1],a] = Q[s[0],s[1],a] + A * pred_err
        s = s_new
        if a==0:
            Q_steps=np.around(Q[:,:,0],1)
            print("第{}步，向UP移动了:\n".format(nstep),Q_steps)
        elif a==1:
            Q_steps=np.around(Q[:,:,1],1)
            print("第{}步，向DOWN移动了:\n".format(nstep),Q_steps)
        elif a==2:
            Q_steps=np.around(Q[:,:,2],1)
            print("第{}步，向LEFT移动了:\n".format(nstep),Q_steps)
        else :
            Q_steps=np.around(Q[:,:,3],1)
            print("第{}步，向RIGHT移动了:\n".format(nstep),Q_steps)
            
    print("第{}个Nepis,R值为：{}\n".format(k,tot_r),Q_steps)

训练200次后，基本确定了步数为选择最上面的保守路线

Tags: RL强化学习

← TD learning Off-policy learning →

赏

使用支付宝打赏

使用微信打赏

若你觉得我的文章对你有帮助，欢迎点击上方按钮对我打赏

扫描二维码，分享此文章