[arXiv 2013] Playing Atari with Deep Reinforcement Learning

Paper url: https://arxiv.org/pdf/1312.5602.pdf

Author and affiliation

Figure 01: paper snapshot

Introduction

Q-learning ์‚ฌ๋ก€์—์„œ ์–ธ๊ธ‰ํ•˜์˜€๋“ฏ์ด ๊ณ ์ฐจ์›์˜ state-action space์—์„œ table ๊ธฐ๋ฐ˜์˜ ๊ฐ•ํ™”ํ•™์Šต์€ ์ ์šฉํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ ๋งŽ์€ ์–‘์˜ memory resource๋ฅผ ํ•„์š”๋กœํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์‹ค์ œ q-value๋ฅผ ๋„์ถœ ํ›„ table์— ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ํŠน์ • ํ•จ์ˆ˜(e.g. linear combination, decision tree, support vector machine)๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ q-value๋ฅผ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ์œผ๋‚˜ ๋ณธ ๋…ผ๋ฌธ์ด ๋ฐœํ‘œ๋˜๋Š” ์‹œ์ ๊ณผ ๋งž๋ฌผ๋ ค ์—ฌ๋Ÿฌ task (ํŠนํžˆ computer vision)์—์„œ ๋‘๊ฐ์„ ๋ณด์ด๋˜ neural network์ด ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋ฉฐ ์ด๋Š” ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต(deep reinforcement learning)์˜ ๊ฐœ๋…์„ ๋„๋ฆฌ ์•Œ๋ฆฌ๊ฒŒ ๋˜์—ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ 2013๋…„ arXiv preprint ํ›„ ์„ธ๋ถ€ ๋‚ด์šฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ 2015๋…„ Nature์— ๊ฒŒ์žฌ [1]๋œ ๋ฐ” ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ deep q-network (DQN)์€ ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต(value-based reinforcement learning)์˜ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ž๋ฆฌ ์žก๊ณ  ์žˆ๋‹ค. Figure 02๋Š” Atari-Breakout์„ ํ”Œ๋ ˆ์ดํ•˜๋Š” DQN agent์˜ ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์•ฝ 400๋ฒˆ์˜ ํ•™์Šต ํšŸ์ˆ˜(episode) ํ›„์—๋Š” ์‚ฌ๋žŒ์ด ํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๋” ๋†€๋ผ์šด ์ ์€ 600๋ฒˆ์˜ ํ•™์Šต์„ ๊ฐ€์ง„ agent๋Š” ์‚ฌ๋žŒ๋„ ์ƒ๊ฐํ•˜๊ธฐ ํž˜๋“  ์ „๋žต์„ ํ„ฐ๋“ํ•˜๋Š”๋ฐ, ํ•œ์ชฝ์˜ ๋ฒฝ๋Œ์„ ๋šซ์€ ํ›„ ๊ณต์„ ๊ทธ ์œ„๋กœ ๋ณด๋‚ด์„œ ์ตœ์†Œํ•œ์˜ ์›€์ง์ž„์œผ๋กœ ๋งŽ์€ ๋ฒฝ๋Œ์„ ๊นจ๋Š” ๋ฐฉ๋ฒ•์ด ๊ทธ๊ฒƒ์ด๋‹ค.

Figure 02: playing Atari Breakout

๊ทธ๋ ‡๋‹ค๋ฉด ์ด๋Ÿฌํ•œ agent๋Š” ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„๊นŒ? DQN์„ ์„ค๋ช…ํ•˜๊ธฐ์— ์•ž์„œ neural network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ q-value๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ์ดˆ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์•Œ์•„๋ณด์ž.

Q-network algorithm

DQN์€ ๊ฐ•ํ™”ํ•™์Šต์— neural network๋ฅผ ์ ‘๋ชฉ์‹œํ‚จ ์ตœ์ดˆ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์•„๋‹ˆ๋‹ค. ๊ธฐ์กด์— neural network ๊ธฐ๋ฐ˜์˜ ์—ฌ๋Ÿฌ ์‹œ๋„๋“ค์ด ์žˆ์—ˆ๊ณ  ๊ฐ•ํ™”ํ•™์Šต์—๋Š” ์‹ ๊ฒฝ๋ง์ด ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค๋Š” ์˜๊ฒฌ์ด ๋‹ค์ˆ˜์˜€๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์€ ์–ด๋–ค ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ์„๊นŒ? ์šฐ์„  q-value๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” network๋Š” state๋ฅผ input์œผ๋กœ ๋ฐ›๊ณ  action์— ๋Œ€ํ•œ q-value๋ฅผ output์œผ๋กœ ์‚ผ๋Š”๋‹ค. ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด Figure 03๊ณผ ๊ฐ™๋‹ค.

Figure 03: q-network architecture

Neural network๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์ด ํ•„์š”ํ•˜๋‹ค. ์˜ˆ์ธก๊ฐ’์€ input state๋ฅผ feed forwardํ•˜๋ฉด ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ  ์ •๋‹ต์€ ๋ฌด์—‡์ด ๋ ๊นŒ? ๋ณต์žกํ•˜๊ฒŒ ์ƒ๊ฐํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. Q-network algorithm์€ ์•ž์—์„œ ๋ฐฐ์šด q-learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ neural network๋กœ ์˜ฎ๊ธฐ๋Š” ๊ณผ์ •์— ๋ถˆ๊ณผํ•˜๋‹ค ๊ฒƒ์„ ์žŠ์œผ๋ฉด ์•ˆ๋œ๋‹ค. Q-learning์—์„œ q-value๋ฅผ ๊ฐฑ์‹ ํ•  ๋•Œ ๋‹ค์Œ ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Q(St,At)=Q(St,At)+ฮฑ[Rt+1+ฮณmaxโกaโ€ฒQ(St+1,aโ€ฒ)โˆ’Q(St,At)]Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1} + \gamma\max_{a^\prime}Q(S_{t+1},a^\prime)-Q(S_t,A_t)]

์œ„ ์‹์—์„œ ๊ฐฑ์‹ ์˜ ๋Œ€์ƒ(q-target)์€ Rt+1+ฮณmaxโกaโ€ฒQ(St+1,aโ€ฒ)R_{t+1} + \gamma\max_{a^\prime}Q(S_{t+1},a^\prime)์ด๋ฉฐ q-network algorithm๋„ ์ด์™€ ๋™์ผํ•˜๋‹ค. Q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•™์Šต ๋Œ€์ƒ(q-target)์˜ notation์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

yt={rtย forย terminalย ฯ•t+1rt+ฮณmaxโกaโ€ฒQ(ฯ•t+1,aโ€ฒ;ฮธ))ย forย non-terminalย ฯ•t+1y_{t}=\left\{\begin{array}{cl} r_{t} & \text { for terminal } \phi_{t+1} \\ \left.r_{t}+\gamma \max _{a^{\prime}} Q\left(\phi_{t+1}, a^{\prime} ; \theta\right)\right) & \text { for non-terminal } \phi_{t+1} \end{array}\right.

Q-target์— ๋Œ€ํ•œ notation์„ ํ™•์ธํ•ด๋ณด๋ฉด, next state๊ฐ€ ์ข…๋ฃŒ ์ƒํƒœ(terminal state)์ผ ๊ฒฝ์šฐ๋Š” ๋ณด์ƒ์„ target์œผ๋กœ, non-terminal state๋ผ๋ฉด q-learning๊ณผ ๋™์ผํ•˜๊ฒŒ ๋ณด์ƒ + ๊ฐ๊ฐ€๋œ ๋‹ค์Œ ์ƒํƒœ์˜ maximum q-value๋ฅผ target์œผ๋กœ ์‚ผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•œ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์ ์€ q-target์˜ maximum q-value ๋˜ํ•œ q-network๋กœ ์ถ”์ •๋œ ๊ฐ’์ด๋ผ๋Š” ์‚ฌ์‹ค์ด๋‹ค.

Q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ q-target๊ณผ state๋ฅผ feed forward ํ–ˆ์„ ๋•Œ ์ถœ๋ ฅ๋˜๋Š” ์˜ˆ์ธก๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์ธ regression task๋ฅผ ํ’€ ๋•Œ์™€ ๋™์ผํ•˜๊ฒŒ mean squared error (MSE)๋ฅผ ์ค„์ด๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค. Q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ ์œผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

J(ฮธ)=(ytโˆ’Q(ฯ•t,at;ฮธ))2J(\theta)=(y_t-Q(\phi_t,a_t;\theta))^2

ํ˜„์žฌ ์‹œ์ ์˜ Q-value์™€ ๋‹ค์Œ ์‹œ์ ์˜ q-target์ด neural network์˜ output์ด๋ผ๋Š” ์‚ฌ์‹ค์„ ์ œ์™ธํ•˜๋ฉด q-learning agent์™€ 100% ๋™์ผํ•˜๋‹ค. Q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ pseudo code๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Figure 04: pseudo code of q-network algorithm

Q-network agent ๋˜ํ•œ ฯต\epsilon์˜ ํ™•๋ฅ ๋กœ random action์„, (1โˆ’ฯต)(1-\epsilon)์˜ ํ™•๋ฅ ๋กœ maximum q-value ํ–‰๋™์„ ์ทจํ•˜๋Š” ฯตโˆ’\epsilon-greedy policy๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ธฐ์กด Q-learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ q-value๋ฅผ ์‹ ๊ฒฝ๋ง์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๋ถ€๋ถ„๋งŒ ๋ณ€๊ฒฝ๋œ ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์–ด๋– ํ•œ ๋‹จ์ ์ด ์žˆ์„๊นŒ? Q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋‹จ์ ์ด์ž, DQN ์ด์ „์˜ neural network ๊ธฐ๋ฐ˜์˜ ๊ฐ•ํ™”ํ•™์Šต ๋‹จ์ ์€ ํฌ๊ฒŒ 3๊ฐ€์ง€๋กœ 1) ๋ถ€์กฑํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฐ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ, 2) high correlation between samples, 3) non-stationary target problem์„ ๋“ค ์ˆ˜ ์žˆ๋‹ค.

  • ๋ถ€์กฑํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฐ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ

์ผ๋ฐ˜์ ์œผ๋กœ neural network๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์€ supervised, unsupervised learning์— ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์—์„œ ์‹œ๊ฐ„ ์ฐจ ํ•™์Šต(temporal difference learning, TD)์˜ ๊ฒฝ์šฐ ํ•œ๋ฒˆ update์— ์‚ฌ์šฉํ•œ transition์€ ๋‹ค์‹œ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๋•Œ๋ฌธ์— neural network๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ์— ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ํž˜๋“ค๋ฉฐ ๊ณผ๊ฑฐ์˜ ์ข‹์€ transition์ด ํœ˜๋ฐœ๋˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

  • High correlation between samples

๊ฐ•ํ™”ํ•™์Šต์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํ•ด๋‹นํ•˜๋Š” transition์€ time-step ๊ฐ„ correlation์ด ๊ต‰์žฅ์ด ๋†’๋‹ค. ์ด๋Š” ์ง์ „ transition์—์„œ action์— ์˜ํ•ด ํ˜„์žฌ ๋ฐ ๋‹ค์Œ transition์ด ๊ฒฐ์ •๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ผ์ƒ์ƒํ™œ์—์„œ ์ €๋… ๋ฉ”๋‰ด๋ฅผ ๊ณ ๋ฅด๋Š” ๋ฐ ์žˆ์–ด ์ ์‹ฌ, ํ˜น์€ ์–ด์ œ ์ €๋…์— ๋จน์—ˆ๋˜ ๋ฉ”๋‰ด๊ฐ€ ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•˜๋‹ค. Sample ๊ฐ„ correlation์ด ๋†’์„ ๊ฒฝ์šฐ network๊ฐ€ ๊ฑฐ์ณ์˜จ transition ์–‘์— ๋น„ํ•ด ์œ ์˜๋ฏธํ•œ ํ•™์Šต๋Ÿ‰์ด ๋งŽ์ง€ ์•Š๊ฑฐ๋‹ˆ์™€(์œ ์‚ฌํ•œ sample, ๋น„์Šทํ•œ action, ๋‚ฎ์€ error), ์ดˆ๊ธฐ transition์—์„œ ์„ ํƒํ•œ action์— ์˜ํ•ด ์•ž์œผ๋กœ์˜ sample๋“ค์ด ์ข…์†๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค. ๋”ฐ๋ผ์„œ global optimum์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•˜๊ณ  local minimum์— ๋น ์งˆ ์œ„ํ—˜์ด ๋ฐœ์ƒํ•œ๋‹ค.

  • Non-stationary target problem

๋งค time-step ๋งˆ๋‹ค q-network๋ฅผ updateํ•˜๋Š” ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด๋ณด์ž. ์•ž์„œ q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋™์ผํ•œ network๋ฅผ ์ด์šฉํ•˜์—ฌ q-target๊ณผ q-value๋ฅผ ๊ทผ์‚ฌํ•œ๋‹ค๊ณ  ์„ค๋ช…ํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” ์ตœ์ ์˜ q-value๋ฅผ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ q-network๋ฅผ ๊ฐฑ์‹ ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ๋™์ผํ•œ network๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŠน์„ฑ์ƒ network ๊ฐฑ์‹  ์ฃผ๊ธฐ์— ๋”ฐ๋ผ์„œ q-target์˜ ๊ทผ์‚ฌ๊ฐ’ ๋˜ํ•œ ๋ณ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์ฆ‰ ์—…๋ฐ์ดํŠธ ๋Œ€์ƒ์ด ๋˜๋Š” q-target์ด ๊ณ ์ •๋˜์–ด ์žˆ์ง€ ์•Š๊ณ  ๊ฐฑ์‹ ํ•  ๋•Œ๋งˆ๋‹ค ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋ฅผ non-stationary target problem์ด๋ผ ์นญํ•˜๋ฉฐ ํ”ํžˆ ์›€์ง์ด๋Š” ๊ณผ๋…์— ํ™”์‚ด์„ ๋งž์ถ”๋Š” ๊ฒƒ์œผ๋กœ ๋น„์œ ํ•œ๋‹ค.

์œ„ 3๊ฐ€์ง€ ๋ฌธ์ œ๋Š” ๋น„๋‹จ q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•œ๊ณ„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ•ํ™”ํ•™์Šต์—์„œ neural network์˜ ํ•œ๊ณ„๋กœ ๋ฐ›์•„๋“ค์—ฌ์ง€๊ณ  ์žˆ์—ˆ๋‹ค. DQN์€ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ idea๋ฅผ ํ†ตํ•ด ์ด๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋Š”๋ฐ ๋‹ค์Œ ์ ˆ์„ ํ†ตํ•ด์„œ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.

Deep q-network

RL-background์˜ ์ทจ์ง€์— ๋งž์ถ”์–ด ์„ธ๋ถ€์ ์ธ preprocessing ๊ณผ์ •์€ ์ƒ๋žตํ•˜๊ณ  3๊ฐ€์ง€ ๋‹จ์ ์„ ์–ด๋–ป๊ฒŒ ๊ฐœ์„ ํ•˜์˜€๋Š”์ง€ ์ค‘์ ์ ์œผ๋กœ ์•Œ์•„๋ณด์ž.

Experience replay

DQN์€ ์•ž์„œ ์–ธ๊ธ‰ํ•œ 3๊ฐ€์ง€ ๋‹จ์  ์ค‘ 1) ๋ถ€์กฑํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ, 2) high correlation between samples ๋ฌธ์ œ๋ฅผ replay buffer์˜ ๋„์ž…์œผ๋กœ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ด๋ฅผ experience replay๋ผ๊ณ  ๋ถ€๋ฅด๋Š”๋ฐ, agent๊ฐ€ ๋งค time-step ๋งˆ๋‹ค ํš๋“ํ•œ transition <state, action, reward, next state> tuple์„ buffer์— ์ €์žฅ ํ›„ ๊บผ๋‚ด์“ฐ๋Š” ๋ฐฉ์‹์ด๋‹ค. Replay buffer๋Š” first in first out ํ˜•ํƒœ๋กœ ๊ธฐ๋ก๋˜๋ฉฐ ์‚ฌ์šฉ์ž๊ฐ€ ์„ค์ •ํ•œ ํ•™์Šต ์ฃผ๊ธฐ๋งˆ๋‹ค buffer์— ์ €์žฅ๋œ transition์„ random samplingํ•˜์—ฌ network๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๋„์‹ํ™” ํ•˜๋ฉด Figure 05์™€ ๊ฐ™๋‹ค.

Figure 05: experience replay

Figure 05์—์„œ ํฐ ํ™”์‚ดํ‘œ์˜ ๊ฒฝ์šฐ environment์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” path๋ฅผ ๋œปํ•˜๊ณ  ์‹ค์„  ํ™”์‚ดํ‘œ์˜ ๊ฒฝ์šฐ CNN ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” transition path๋ฅผ ๋œปํ•œ๋‹ค.

Environment์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” transition๋“ค์„ replay buffer์— ์ €์žฅํ•˜๋Š” ๊ฒƒ์„ ํ†ตํ•ด ๋ถ€์กฑํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ ์–‘์„ ๋ณด์ถฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ CNN์„ ํ•™์Šตํ•  ๋•Œ buffer๋กœ ๋ถ€ํ„ฐ random sampled transition์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ sample๊ฐ„ ๋†’์€ correlation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค.

Fixed q-target

Non-stationary target problem์€ q-target๊ณผ q-value๋ฅผ ๋™์ผํ•œ network๋กœ ์ถ”์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•˜์˜€๋‹ค. ์ €์ž๋“ค์€ ์ด๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ฐ๊ฐ์„ ๋”ฐ๋กœ ์ถ”์ •ํ•˜๋„๋ก ๋‘ ๊ฐœ์˜ network๋ฅผ ๋‘ ์œผ๋กœ์จ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. Q-target์„ ์ถ”์ •ํ•˜๋Š” network๋ฅผ target network (ฮธห‰\bar\theta) ๋กœ ๋ถ€๋ฅด๋ฉฐ ์ผ์ • ์ฃผ๊ธฐ๋งˆ๋‹ค q-value๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” network (ฮธ\theta)์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ณต์‚ฌํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค. ๋งŒ์•ฝ ๋งค batch๋งˆ๋‹ค q-value๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” network๋ฅผ ๊ฐฑ์‹ ํ•˜๊ณ , 32 batch ๋งˆ๋‹ค target network๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ณต์‚ฌํ•˜๋ฉด Figure 06๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๋ˆ๋‹ค.

Figure 06: fixed q-target

์ด์™€ ๊ฐ™์ด ๋‘ ๊ฐœ์˜ network๋ฅผ ๋‘๊ณ  ๊ฐ๊ฐ q-target๊ณผ q-value๋ฅผ ๊ทผ์‚ฌํ•จ์œผ๋กœ์จ non-stationary target problem์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด DQN์€ replay buffer์˜ ๋„์ž…(experience replay)๊ณผ q-value๋ฅผ ์ถ”์ •ํ•˜๋Š” network์™€ ๋…๋ฆฝ๋œ q-target network๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๊ธฐ์กด q-network ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋‹จ์ ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์ง€ ์•Š๋Š”๊ฐ€? DQN์˜ pseudo code๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Figure 07: pseudo code of DQN

Q-network pseudo code์™€ ๋‹ฌ๋ฆฌ target์„ ์ถ”์ •ํ•˜๋Š” network์™€ q-value๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” network๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ผ์ • ์ฃผ๊ธฐ๋งˆ๋‹ค target network๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Results

DQN์€ Atari ํ™˜๊ฒฝ์˜ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒŒ์ž„์—์„œ ์‚ฌ๋žŒ์˜ ์„ฑ๋Šฅ์„ ๋„˜๊ฑฐ๋‚˜ ๊ทธ์— ์ค€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. Figure 08์€ 2015 Nature์— ๊ฒŒ์žฌ๋œ ๋…ผ๋ฌธ์— ๊ธฐ๋ก๋œ ์„ฑ๋Šฅ ๋„ํ‘œ์ด๋‹ค.

Figure 08: DQN's performance on Atari games

์•ž์„œ DQN์€ experience replay์™€ fixed q-target์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ๋ฐํ˜”๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ contribution์€ ์–ด๋А์ •๋„ ๋ ๊นŒ? ์ด์— ๋Œ€ํ•œ ablation study๋Š” Figure 09์™€ ๊ฐ™๋‹ค.

Figure 09: ablation sutdy

์ œ์ผ ์šฐ์ธก์ด ๊ธฐ์กด์˜ q-network algorithm์— ํ•ด๋‹นํ•œ๋‹ค. Breakout์„ ๊ธฐ์ค€์œผ๋กœ fixed q-target์€ q-network algorithm ๋Œ€๋น„ ์•ฝ 3๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„, experience replay๋Š” ์•ฝ 80๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ๋‘˜์„ ๋ชจ๋‘ ํ•ฉํ•œ DQN์€ ์•ฝ 100๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

DQN์€ ์•ž์„œ ์–ธ๊ธ‰ํ•˜์˜€๋“ฏ ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๋˜ํ•œ deep reinforce-ment learning (์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต)์˜ ํฌ๋ฌธ์„ ์—ด์–ด์ –ํžŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊ธฐ์—ฌํ•˜๋Š” ๋ฐ”๊ฐ€ ์ปค Nature์—๋„ ๋“ฑ์žฌ๋˜์—ˆ๋‹ค. DQN์ด ์ฒ˜์Œ ๋‚˜์˜จ ์ง€ 8๋…„์—ฌ์˜ ์‹œ๊ฐ„์ด ํ๋ฅธ ๋งŒํผ ์ด๋ฅผ ๊ฐœ์„ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋งŽ์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค.

์—ฌ๊ธฐ๊นŒ์ง€ ์ฝ์€ ๋…์ž๋“ค์€ ๊ณผ์—ฐ DQN์˜ ์–ด๋–ค ์ ์„ ๋” ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์„๊ฒƒ์œผ๋กœ ๋ณด์ด๋Š”๊ฐ€? ๋‹ค์Œ ์žฅ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์— ์ž ๊น์˜ ์ƒ๊ฐ์„ ๊ฐ€์ ธ๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

Reference

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.

ํŒจํ‚ค์ง€ ์—†์ด R๋กœ ๊ตฌํ˜„ํ•˜๋Š” ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต [yes24] [๊ต๋ณด๋ฌธ๊ณ ]

Last updated

Was this helpful?