Tại sao Q Learning phân kỳ?

Các giá trị trạng thái của thuật toán Q-Learning của tôi tiếp tục chuyển sang vô cùng, điều đó có nghĩa là các trọng số của tôi cũng đang chuyển hướng. Tôi sử dụng một mạng lưới thần kinh cho ánh xạ giá trị của tôi.

Tôi đã thử:

Cắt "phần thưởng + chiết khấu * giá trị hành động tối đa" (tối đa / phút được đặt thành 50 / -50)
Đặt tỷ lệ học tập thấp (0,00001 và tôi sử dụng Backpropagation cổ điển để cập nhật trọng số)
Giảm giá trị của phần thưởng
Tăng tỷ lệ thăm dò
Bình thường hóa đầu vào từ 1 ~ 100 (trước đây là 0 ~ 1)
Thay đổi tỷ lệ chiết khấu
Giảm các lớp của mạng nơ ron (chỉ để xác nhận)

Tôi đã nghe nói rằng Q Learning được biết là phân kỳ đầu vào phi tuyến tính, nhưng có điều gì khác mà tôi có thể cố gắng ngăn chặn sự phân kỳ của các trọng số không?

Cập nhật số 1 vào ngày 14 tháng 8 năm 2017:

Tôi đã quyết định thêm một số chi tiết cụ thể về những gì tôi đang làm ngay bây giờ do yêu cầu.

Tôi hiện đang cố gắng làm cho một đặc vụ học cách chiến đấu trong một cái nhìn từ trên xuống của một trò chơi bắn súng. Đối thủ là một bot đơn giản di chuyển ngẫu nhiên.

Mỗi nhân vật có 9 hành động để lựa chọn trong mỗi lượt:

đi lên
đi xuống
di chuyển sang trái
đi sang phải
bắn một viên đạn lên trên
bắn một viên đạn xuống
bắn một viên đạn sang trái
bắn một viên đạn sang phải
không làm gì cả

Phần thưởng là:

nếu tác nhân tấn công bot bằng một viên đạn, +100 (Tôi đã thử nhiều giá trị khác nhau)
nếu tác nhân bị trúng đạn bởi bot, -50 (một lần nữa, tôi đã thử nhiều giá trị khác nhau)
nếu đặc vụ cố gắng bắn một viên đạn trong khi viên đạn không thể được bắn (ví dụ: khi đặc vụ chỉ bắn một viên đạn, v.v.), -25 (Không cần thiết nhưng tôi muốn tác nhân hoạt động hiệu quả hơn)
nếu bot cố gắng đi ra khỏi đấu trường, -20 (Không cần thiết lắm nhưng tôi muốn tác nhân hoạt động hiệu quả hơn)

Các đầu vào cho mạng thần kinh là:

Khoảng cách giữa tác nhân và bot trên trục X được chuẩn hóa thành 0 ~ 100
Khoảng cách giữa tác nhân và bot trên trục Y được chuẩn hóa thành 0 ~ 100
Vị trí đại lý x và y
Vị trí x và y của Bot
Vị trí đạn của Bot. Nếu bot không bắn một viên đạn, các tham số được đặt thành vị trí x và y của bot.

Tôi cũng loay hoay với đầu vào quá; Tôi đã thử thêm các tính năng mới như giá trị x của vị trí của tác nhân (không phải khoảng cách mà là vị trí thực tế) và vị trí của viên đạn của bot. Không ai trong số họ làm việc.

Đây là mã:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
    """Initializes parameters"""
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

    agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
    bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
    agent_hp = bot_hp = character_init_health
    agent_beam_fire = bot_beam_fire = False
    agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
    agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
    global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \
    agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \
    agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

    disp.fill(aqua)
    draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
                            2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
    draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
                            disp_y / 2 - arena_y / 2, arena_x, arena_y))
    if bot_beam_fire == True:
        draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
        bot_beam_fire = False
    if agent_beam_fire == True:
        draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
        agent_beam_fire = False

    draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
    draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

    draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
    draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
    return random.randint(1, 9)

def beam_hit_detector(player):
    global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \
    bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \
    bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

    if player == "bot":
        if bot_current_action == 1:
            if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 2:
            if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
        elif bot_current_action == 3:
            if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
    else:
        if agent_current_action == 1:
            if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif agent_current_action == 2:
            if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False
        elif agent_current_action == 3:
            if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False


def mapping(maximum, number):
    return number#int(number * maximum)

def action(agent_action, bot_action):
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \
    bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \
    agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

    agent_current_action = agent_action; bot_current_action = bot_action
    reward = 0; cont = True; successful = False; winner = ""
    if 1 <= bot_action <= 4:
        bot_beam_fire = True
        if bot_action == 1:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
            bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
        elif bot_action == 2:
            bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
        elif bot_action == 3:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
            bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
        elif bot_action == 4:
            bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

    elif 5 <= bot_action <= 8:
        if bot_action == 5:
            bot_y -= character_move_speed
            if bot_y <= disp_y/2 - arena_y/2:
                bot_y = disp_y/2 - arena_y/2
            elif agent_y <= bot_y <= agent_y + character_size:
                bot_y = agent_y + character_size
        elif bot_action == 6:
            bot_x += character_move_speed
            if bot_x >= disp_x/2 + arena_x/2 - character_size:
                bot_x = disp_x/2 + arena_x/2 - character_size
            elif agent_x <= bot_x + character_size <= agent_x + character_size:
                bot_x = agent_x - character_size
        elif bot_action == 7:
            bot_y += character_move_speed
            if bot_y + character_size >= disp_y/2 + arena_y/2:
                bot_y = disp_y/2 + arena_y/2 - character_size
            elif agent_y <= bot_y + character_size <= agent_y + character_size:
                bot_y = agent_y - character_size
        elif bot_action == 8:
            bot_x -= character_move_speed
            if bot_x <= disp_x/2 - arena_x/2:
                bot_x = disp_x/2 - arena_x/2
            elif agent_x <= bot_x <= agent_x + character_size:
                bot_x = agent_x + character_size

    if bot_beam_fire == True:
        if beam_hit_detector("bot"):
            #print "Agent Got Hit!"
            agent_hp -= beam_damage
            reward += -50
            bot_beam_size_x = bot_beam_size_y = 0
            bot_beam_x = bot_beam_y = beam_ob
            if agent_hp <= 0:
                cont = False
                winner = "Bot"

    if 1 <= agent_action <= 4:
        agent_beam_fire = True
        if agent_action == 1:
            if agent_y > disp_y/2 - arena_y/2:
                agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
                agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
            else:
                reward += -25
        elif agent_action == 2:
            if agent_x + character_size < disp_x/2 + arena_x/2:
                agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
            else:
                reward += -25
        elif agent_action == 3:
            if agent_y + character_size < disp_y/2 + arena_y/2:
                agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
                agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
            else:
                reward += -25
        elif agent_action == 4:
            if agent_x > disp_x/2 - arena_x/2:
                agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
            else:
                reward += -25

    elif 5 <= agent_action <= 8:
        if agent_action == 5:
            agent_y -= character_move_speed
            if agent_y <= disp_y/2 - arena_y/2:
                agent_y = disp_y/2 - arena_y/2
                reward += -5
            elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y + character_size
                reward += -2
        elif agent_action == 6:
            agent_x += character_move_speed
            if agent_x + character_size >= disp_x/2 + arena_x/2:
                agent_x = disp_x/2 + arena_x/2 - character_size
                reward += -5
            elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x - character_size
                reward += -2
        elif agent_action == 7:
            agent_y += character_move_speed
            if agent_y + character_size >= disp_y/2 + arena_y/2:
                agent_y = disp_y/2 + arena_y/2 - character_size
                reward += -5
            elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y - character_size
                reward += -2
        elif agent_action == 8:
            agent_x -= character_move_speed
            if agent_x <= disp_x/2 - arena_x/2:
                agent_x = disp_x/2 - arena_x/2
                reward += -5
            elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x + character_size
                reward += -2
    if agent_beam_fire == True:
        if beam_hit_detector("agent"):
            #print "Bot Got Hit!"
            bot_hp -= beam_damage
            reward += 50
            agent_beam_size_x = agent_beam_size_y = 0
            agent_beam_x = agent_beam_y = beam_ob
            if bot_hp <= 0:
                successful = True
                cont = False
                winner = "Agent"
    return reward, cont, successful, winner

def bot_beam_dir_detector():
    global bot_current_action
    if bot_current_action == 1:
        bot_beam_dir = 2
    elif bot_current_action == 2:
        bot_beam_dir = 4
    elif bot_current_action == 3:
        bot_beam_dir = 3
    elif bot_current_action == 4:
        bot_beam_dir = 1
    else:
        bot_beam_dir = 0
    return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
    sess.run(initialize)
    success = 0
    for i in tqdm(range(1, num_episodes)):
        #print "Episode #", i
        rAll = 0; d = False; c = True; j = 0
        param_init()
        samples = []
        while c == True:
            j += 1
            current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            b = bot_take_action()
            if np.random.rand(1) < e or i <= 5:
                a = random.randint(0, 8)
            else:
                a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state})
            r, c, d, winner = action(a + 1, b)
            bot_beam_dir = bot_beam_dir_detector()
            next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            samples.append([current_state, a, r, next_state])
            if len(samples) > 10:
                for count in xrange(batch_size):
                    [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
                    batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state})
                    batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state})
                    batch_maxQ1 = np.max(batch_Q1)
                    batch_targetQ = batch_allQ
                    batch_targetQ[0][a] = reward + y * batch_maxQ1
                    sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ})
            rAll += r
            screen_blit()
            if d == True:
                e = 1. / ((i / 50) + 10)
                success += 1
                break
            #print agent_hp, bot_hp
            display.update()

        jList.append(j)
        rList.append(rAll)
        print winner

Tôi khá chắc chắn rằng nếu bạn đã cài đặt pygame và Tensorflow và matplotlib trong môi trường python, bạn sẽ có thể thấy hình ảnh động của bot và tác nhân "chiến đấu".

Tôi lạc đề trong bản cập nhật, nhưng thật tuyệt vời nếu ai đó cũng có thể giải quyết vấn đề cụ thể của tôi cùng với vấn đề chung ban đầu.

Cảm ơn!

Cập nhật số 2 vào ngày 18 tháng 8 năm 2017:

Dựa trên lời khuyên của @NeilSlater, tôi đã triển khai phát lại kinh nghiệm vào mô hình của mình. Thuật toán đã được cải thiện, nhưng tôi sẽ tìm kiếm các tùy chọn cải tiến tốt hơn mang lại sự hội tụ.

Cập nhật số 3 vào ngày 22 tháng 8 năm 2017:

Tôi đã nhận thấy rằng nếu đặc vụ tấn công bot bằng một viên đạn trong lượt và hành động mà bot thực hiện trong lượt đó không phải là "bắn một viên đạn", thì những hành động sai lầm sẽ được ghi nhận. Do đó, tôi đã biến những viên đạn thành chùm để bot / tác nhân gây sát thương trong lượt bắn của chùm tia.

— Sắt
nguồn

Bạn có đang sử dụng các giá trị phát lại và bootstrapping kinh nghiệm từ một bản sao "đóng băng" của mạng gần đây không? Đây là những cách tiếp cận được sử dụng trong DQN - chúng không được đảm bảo mặc dù chúng có thể cần thiết cho sự ổn định. Bạn đang sử dụng Q (

λ

$\lambda$ ) thuật toán, hay chỉ học Q-bước đơn? Bạn có thể đưa ra một số dấu hiệu về môi trường và chương trình khen thưởng của bạn là như thế nào? Q-learning một bước sẽ hoạt động kém khi phần thưởng thưa thớt, ví dụ như phần thưởng +1 hoặc -1 cuối cùng ở cuối tập dài.

— Neil Slater

OK, từ bản cập nhật của bạn, tôi ngay lập tức đề nghị bạn cần phát lại kinh nghiệm và có lẽ cả các mạng xen kẽ để khởi động, bởi vì đây là những ảnh hưởng ổn định đến việc học tăng cường với các xấp xỉ phi tuyến tính. Tôi rất vui khi nói chi tiết về điều đó và xem mã dự án của bạn để đưa ra một ví dụ, nhưng có thể mất một hoặc hai ngày để quay lại với bạn với mức độ chi tiết đó ,.

— Neil Slater

Tôi đã có được mã đang chạy và nếu tôi hiểu đúng về nó, các viên đạn có thể được "điều khiển" bởi lựa chọn tác nhân từ các hành động 1-4 mỗi lượt, tức là viên đạn có thể được di chuyển theo bất kỳ hướng nào trong khi tác nhân vẫn đứng yên. Đó có phải là cố ý? Bot không làm điều này bởi vì nó chỉ kích hoạt khi được căn chỉnh trên lưới với tác nhân và luôn chọn cùng hướng nếu nó làm như vậy.

— Neil Slater

Gần như đúng, nhưng bạn không lưu trữ giá trị bootstrapping, thay vào đó hãy tính lại nó khi bước được lấy mẫu sau đó. Đối với mỗi hành động được thực hiện, bạn lưu trữ bốn điều: Trạng thái, Hành động, Trạng thái tiếp theo, Phần thưởng. Sau đó, bạn lấy một lô nhỏ (1 bước là tốt, nhưng ví dụ 10 là điển hình) từ danh sách này và đối với Q-learning tính toán hành động tối đa mới và giá trị của nó để tạo ra lô nhỏ học có giám sát (còn gọi là Mục tiêu TD ).

— Neil Slater

Đó phải là "bản sao đóng băng của xấp xỉ (tức là mạng nơ ron" (nếu trích dẫn là từ một trong những nhận xét hoặc câu trả lời của tôi, vui lòng chỉ cho tôi và tôi sẽ sửa nó. Rất đơn giản - chỉ cần giữ hai bản sao thông số

w

$\mathbf{w}$ , một bản "trực tiếp" mà bạn cập nhật và một bản "cũ" gần đây mà bạn sao chép từ bản "trực tiếp" cứ sau vài trăm bản cập nhật. Khi bạn tính toán mục tiêu TD, vd

R + γ {max}_{a^{'}} \hat{q} (S^{'}, a^{'}, w)

$R + \gamma \text{max}_{a'} \hat{q}(S',a',\mathbf{w})$ sau đó sử dụng bản sao "cũ" để tính toán

\hat{q}

$\hat{q}$ , nhưng sau đó đào tạo người "sống" với những giá trị đó.

— Neil Slater

Nếu trọng lượng của bạn đang chuyển hướng thì trình tối ưu hóa hoặc độ dốc của bạn không hoạt động tốt. Một lý do phổ biến để phân kỳ trọng lượng là nổ độ dốc , có thể xuất phát từ:

quá nhiều lớp, hoặc
quá nhiều chu kỳ lặp lại nếu bạn đang sử dụng RNN.

Bạn có thể xác minh nếu bạn có độ dốc nổ như sau:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
                                for g in tf.gradients(loss, weights_list)])**0.5

Một số cách tiếp cận để giải quyết vấn đề nổ gradient là:

Sử dụng kích hoạt RELU hoặc ELU
Sử dụng khởi tạo Xavier
Sử dụng kiến trúc Deep Residual . Điều này sẽ giữ cho gradient không bị cắt bởi các lớp tiếp theo.

— Hinh ảnh mặc định
nguồn