[arXiv 2013] Playing Atari with Deep Reinforcement Learning

Paper url: https://arxiv.org/pdf/1312.5602.pdf

Author and affiliation

Figure 01: paper snapshot

Introduction

Q-learning ģ‚¬ė”€ģ—ģ„œ ģ–øźø‰ķ•˜ģ˜€ė“Æģ“ ź³ ģ°Øģ›ģ˜ state-action spaceģ—ģ„œ table źø°ė°˜ģ˜ ź°•ķ™”ķ•™ģŠµģ€ ģ ģš©ķ•  수 ģ—†ź±°ė‚˜ ė§Žģ€ ģ–‘ģ˜ memory resource넼 ķ•„ģš”ė”œķ•˜źø° ė•Œė¬øģ— ģ ķ•©ķ•˜ģ§€ ģ•Šė‹¤. ģ“ė„¼ ģœ„ķ•“ ģ‹¤ģ œ q-value넼 ė„ģ¶œ 후 table에 ģ €ģž„ķ•˜ėŠ” ė°©ģ‹ģ„ ģ‚¬ģš©ķ•˜ģ§€ ģ•Šź³  ķŠ¹ģ • ķ•Øģˆ˜(e.g. linear combination, decision tree, support vector machine)넼 ģ“ģš©ķ•˜ģ—¬ ź·¼ģ‚¬ķ•˜ėŠ” ė°©ė²•ģ“ ė“±ģž„ķ•˜ģ˜€ė‹¤. ė‹¤ģ–‘ķ•œ ķ•Øģˆ˜ė„¼ ķ†µķ•“ģ„œ q-value넼 근사할 수 ģžˆģœ¼ė‚˜ ė³ø ė…¼ė¬øģ“ ė°œķ‘œė˜ėŠ” ģ‹œģ ź³¼ ė§žė¬¼ė ¤ ģ—¬ėŸ¬ task (ķŠ¹ķžˆ computer vision)ģ—ģ„œ ė‘ź°ģ„ ė³“ģ“ė˜ neural networkģ“ 주딜 ģ‚¬ģš©ė˜ė©° ģ“ėŠ” 심층 ź°•ķ™”ķ•™ģŠµ(deep reinforcement learning)ģ˜ ź°œė…ģ„ 널리 ģ•Œė¦¬ź²Œ ė˜ģ—ˆė‹¤.

ė³ø ė…¼ė¬øģ€ 2013ė…„ arXiv preprint 후 세부 ė‚“ģš©ģ„ ģ¶”ź°€ķ•˜ģ—¬ 2015ė…„ Nature에 ź²Œģž¬ [1]된 ė°” ģžˆė‹¤. ė…¼ė¬øģ—ģ„œ ģ œģ•ˆėœ deep q-network (DQN)ģ€ ź°€ģ¹˜ 기반 ź°•ķ™”ķ•™ģŠµ(value-based reinforcement learning)ģ˜ ėŒ€ķ‘œģ ģø ģ•Œź³ ė¦¬ģ¦˜ģœ¼ė”œ ģžė¦¬ ģž”ź³  ģžˆė‹¤. Figure 02ėŠ” Atari-Breakoutģ„ ķ”Œė ˆģ“ķ•˜ėŠ” DQN agentģ˜ ėŖØģŠµģ„ ė³¼ 수 ģžˆģœ¼ė©° 약 400ė²ˆģ˜ ķ•™ģŠµ 횟수(episode) ķ›„ģ—ėŠ” ģ‚¬ėžŒģ“ ķ•˜ėŠ” 것과 ģœ ģ‚¬ķ•œ ģ„±ėŠ„ģ„ ė³“ģøė‹¤. ė” ė†€ė¼ģš“ ģ ģ€ 600ė²ˆģ˜ ķ•™ģŠµģ„ 가진 agentėŠ” ģ‚¬ėžŒė„ ģƒź°ķ•˜źø° ķž˜ė“  ģ „ėžµģ„ ķ„°ė“ķ•˜ėŠ”ė°, ķ•œģŖ½ģ˜ ė²½ėŒģ„ ėš«ģ€ 후 ź³µģ„ ź·ø ģœ„ė”œ ė³“ė‚“ģ„œ ģµœģ†Œķ•œģ˜ ģ›€ģ§ģž„ģœ¼ė”œ ė§Žģ€ ė²½ėŒģ„ ź¹ØėŠ” ė°©ė²•ģ“ ź·øź²ƒģ“ė‹¤.

Figure 02: playing Atari Breakout

그렇다멓 ģ“ėŸ¬ķ•œ agentėŠ” ģ–“ė–»ź²Œ ė§Œė“¤ 수 ģžˆģ„ź¹Œ? DQNģ„ ģ„¤ėŖ…ķ•˜źø°ģ— ģ•žģ„œ neural network넼 ģ‚¬ģš©ķ•˜ģ—¬ q-value넼 ź·¼ģ‚¬ķ•˜ėŠ” ź°€ģž„ źø°ģ“ˆģ ģø ģ•Œź³ ė¦¬ģ¦˜ģ„ ģ•Œģ•„ė³“ģž.

Q-network algorithm

DQNģ€ ź°•ķ™”ķ•™ģŠµģ— neural network넼 ģ ‘ėŖ©ģ‹œķ‚Ø ģµœģ“ˆģ˜ ģ•Œź³ ė¦¬ģ¦˜ģ“ ģ•„ė‹ˆė‹¤. 기씓에 neural network źø°ė°˜ģ˜ ģ—¬ėŸ¬ ģ‹œė„ė“¤ģ“ ģžˆģ—ˆź³  ź°•ķ™”ķ•™ģŠµģ—ėŠ” ģ‹ ź²½ė§ģ“ ģ ķ•©ķ•˜ģ§€ ģ•Šė‹¤ėŠ” ģ˜ź²¬ģ“ ė‹¤ģˆ˜ģ˜€ė‹¤. źø°ģ”“ģ˜ ė°©ė²•ģ€ ģ–“ė–¤ ė¬øģ œģ ģ“ ģžˆģ—ˆģ„ź¹Œ? ģš°ģ„  q-value넼 ź·¼ģ‚¬ķ•˜ėŠ” networkėŠ” state넼 input으딜 받고 action에 ėŒ€ķ•œ q-value넼 output으딜 ģ‚¼ėŠ”ė‹¤. 그림으딜 ķ‘œķ˜„ķ•˜ė©“ Figure 03ź³¼ 같다.

Figure 03: q-network architecture

Neural network넼 ķ•™ģŠµķ•˜źø° ģœ„ķ•“ģ„œėŠ” ėŖØėøģ˜ ģ˜ˆģø”ź°’ź³¼ ģ •ė‹µģ“ ķ•„ģš”ķ•˜ė‹¤. ģ˜ˆģø”ź°’ģ€ input state넼 feed forwardķ•˜ė©“ 구할 수 ģžˆź³  ģ •ė‹µģ€ ė¬“ģ—‡ģ“ 될까? ė³µģž”ķ•˜ź²Œ ģƒź°ķ•  ķ•„ģš”ź°€ 없다. Q-network algorithmģ€ ģ•žģ—ģ„œ 배욓 q-learning ģ•Œź³ ė¦¬ģ¦˜ģ„ neural network딜 ģ˜®źø°ėŠ” 과정에 ė¶ˆź³¼ķ•˜ė‹¤ ź²ƒģ„ ģžŠģœ¼ė©“ ģ•ˆėœė‹¤. Q-learningģ—ģ„œ q-value넼 갱신할 ė•Œ ė‹¤ģŒ ģ‹ģ„ ģ‚¬ģš©ķ•˜ģ˜€ė‹¤.

Q(St,At)=Q(St,At)+α[Rt+1+γmax⁔a′Q(St+1,a′)āˆ’Q(St,At)]Q(S_t,A_t)=Q(S_t,A_t)+\alpha[R_{t+1} + \gamma\max_{a^\prime}Q(S_{t+1},a^\prime)-Q(S_t,A_t)]

ģœ„ ģ‹ģ—ģ„œ ź°±ģ‹ ģ˜ ėŒ€ģƒ(q-target)ģ€ Rt+1+γmax⁔a′Q(St+1,a′)R_{t+1} + \gamma\max_{a^\prime}Q(S_{t+1},a^\prime)ģ“ė©° q-network algorithmė„ ģ“ģ™€ ė™ģ¼ķ•˜ė‹¤. Q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķ•™ģŠµ ėŒ€ģƒ(q-target)ģ˜ notationģ€ ė‹¤ģŒź³¼ 같다.

yt={rtĀ forĀ terminalĀ Ļ•t+1rt+γmax⁔a′Q(Ļ•t+1,a′;Īø))Ā forĀ non-terminalĀ Ļ•t+1y_{t}=\left\{\begin{array}{cl} r_{t} & \text { for terminal } \phi_{t+1} \\ \left.r_{t}+\gamma \max _{a^{\prime}} Q\left(\phi_{t+1}, a^{\prime} ; \theta\right)\right) & \text { for non-terminal } \phi_{t+1} \end{array}\right.

Q-target에 ėŒ€ķ•œ notationģ„ ķ™•ģøķ•“ė³“ė©“, next stateź°€ ģ¢…ė£Œ 상태(terminal state)ģ¼ ź²½ģš°ėŠ” ė³“ģƒģ„ target으딜, non-terminal stateė¼ė©“ q-learningź³¼ ė™ģ¼ķ•˜ź²Œ 볓상 + ź°ź°€ėœ ė‹¤ģŒ ģƒķƒœģ˜ maximum q-value넼 target으딜 ģ‚¼ėŠ” ź²ƒģ„ ķ™•ģøķ•  수 ģžˆė‹¤. ķ•œ 가지 다넸 ģ ģ€ q-targetģ˜ maximum q-value ė˜ķ•œ q-network딜 ģ¶”ģ •ėœ ź°’ģ“ė¼ėŠ” ģ‚¬ģ‹¤ģ“ė‹¤.

Q-network ģ•Œź³ ė¦¬ģ¦˜ģ€ q-targetź³¼ state넼 feed forward ķ–ˆģ„ ė•Œ ģ¶œė „ė˜ėŠ” ģ˜ˆģø”ź°’ ģ‚¬ģ“ģ˜ ģ°Øģ“ė„¼ ģ¤„ģ“ėŠ” ė°©ķ–„ģœ¼ė”œ ķ•™ģŠµģ„ ģ§„ķ–‰ķ•œė‹¤. ģ“ėŠ” ģ¼ė°˜ģ ģø regression task넼 ķ’€ ė•Œģ™€ ė™ģ¼ķ•˜ź²Œ mean squared error (MSE)넼 ģ¤„ģ“ėŠ” 것과 ė™ģ¼ķ•˜ė‹¤. Q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ ėŖ©ģ ķ•Øģˆ˜ė„¼ 적으멓 ė‹¤ģŒź³¼ 같다.

J(Īø)=(ytāˆ’Q(Ļ•t,at;Īø))2J(\theta)=(y_t-Q(\phi_t,a_t;\theta))^2

ķ˜„ģž¬ ģ‹œģ ģ˜ Q-value와 ė‹¤ģŒ ģ‹œģ ģ˜ q-targetģ“ neural networkģ˜ outputģ“ė¼ėŠ” ģ‚¬ģ‹¤ģ„ ģ œģ™øķ•˜ė©“ q-learning agent와 100% ė™ģ¼ķ•˜ė‹¤. Q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ pseudo codeėŠ” ė‹¤ģŒź³¼ 같다.

Figure 04: pseudo code of q-network algorithm

Q-network agent ė˜ķ•œ ϵ\epsilonģ˜ ķ™•ė„ ė”œ random actionģ„, (1āˆ’Ļµ)(1-\epsilon)ģ˜ ķ™•ė„ ė”œ maximum q-value ķ–‰ė™ģ„ ģ·Øķ•˜ėŠ” Ļµāˆ’\epsilon-greedy policy넼 ė”°ė„øė‹¤ėŠ” ź²ƒģ„ ģ•Œ 수 ģžˆė‹¤.

기씓 Q-learning ģ•Œź³ ė¦¬ģ¦˜ģ—ģ„œ q-value넼 ģ‹ ź²½ė§ģœ¼ė”œ ź·¼ģ‚¬ķ•˜ėŠ” ė¶€ė¶„ė§Œ ė³€ź²½ėœ ģ“ ģ•Œź³ ė¦¬ģ¦˜ģ€ ģ–“ė– ķ•œ ė‹Øģ ģ“ ģžˆģ„ź¹Œ? Q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ ė‹Øģ ģ“ģž, DQN ģ“ģ „ģ˜ neural network źø°ė°˜ģ˜ ź°•ķ™”ķ•™ģŠµ ė‹Øģ ģ€ 크게 3ź°€ģ§€ė”œ 1) ė¶€ģ”±ķ•œ ķ•™ģŠµ ė°ģ“ķ„° ė° ė‚®ģ€ ė°ģ“ķ„° ķšØģœØģ„±, 2) high correlation between samples, 3) non-stationary target problemģ„ 들 수 ģžˆė‹¤.

  • ė¶€ģ”±ķ•œ ķ•™ģŠµ ė°ģ“ķ„° ė° ė‚®ģ€ ė°ģ“ķ„° ķšØģœØģ„±

ģ¼ė°˜ģ ģœ¼ė”œ neural networkėŠ” ķ•™ģŠµ ė°ģ“ķ„°ź°€ ė§Žģ€ supervised, unsupervised learning에 ķšØģœØģ ģø ė°©ė²•ģ“ė‹¤. ź°•ķ™”ķ•™ģŠµģ—ģ„œ ģ‹œź°„ ģ°Ø ķ•™ģŠµ(temporal difference learning, TD)ģ˜ 경우 ķ•œė²ˆ update에 ģ‚¬ģš©ķ•œ transitionģ€ ė‹¤ģ‹œ ģ‚¬ģš©ķ•˜ģ§€ ģ•ŠėŠ”ė‹¤. ģ“ ė•Œė¬øģ— neural network넼 ķ•™ģŠµģ‹œķ‚¤źø°ģ— ģ¶©ė¶„ķ•œ ė°ģ“ķ„°ė„¼ ķ™•ė³“ķ•˜źø° ķž˜ė“¤ė©° ź³¼ź±°ģ˜ ģ¢‹ģ€ transitionģ“ ķœ˜ė°œė˜ėŠ” ė‹Øģ ģ“ ģžˆė‹¤.

  • High correlation between samples

ź°•ķ™”ķ•™ģŠµģ˜ ķ•™ģŠµ ė°ģ“ķ„°ģ— ķ•“ė‹¹ķ•˜ėŠ” transitionģ€ time-step ź°„ correlationģ“ źµ‰ģž„ģ“ 높다. ģ“ėŠ” 직전 transitionģ—ģ„œ action에 ģ˜ķ•“ ķ˜„ģž¬ ė° ė‹¤ģŒ transitionģ“ ź²°ģ •ė˜źø° ė•Œė¬øģ“ė‹¤. ģ¼ģƒģƒķ™œģ—ģ„œ 저녁 메뉓넼 ź³ ė„“ėŠ” ė° ģžˆģ–“ 점심, ķ˜¹ģ€ ģ–“ģ œ 저녁에 ėØ¹ģ—ˆė˜ 메뉓가 ģ˜ķ–„ģ„ ģ£¼ėŠ” 것과 ģœ ģ‚¬ķ•˜ė‹¤. Sample ź°„ correlationģ“ ė†’ģ„ 경우 networkź°€ 거쳐온 transition 양에 비핓 ģœ ģ˜ėÆøķ•œ ķ•™ģŠµėŸ‰ģ“ ė§Žģ§€ ģ•Šź±°ė‹ˆģ™€(ģœ ģ‚¬ķ•œ sample, ė¹„ģŠ·ķ•œ action, ė‚®ģ€ error), 쓈기 transitionģ—ģ„œ ģ„ ķƒķ•œ action에 ģ˜ķ•“ ģ•žģœ¼ė”œģ˜ sampleė“¤ģ“ ģ¢…ģ†ė˜ėŠ” ķ˜„ģƒģ“ ė°œģƒķ•œė‹¤. ė”°ė¼ģ„œ global optimum에 ė„ė‹¬ķ•˜ģ§€ ėŖ»ķ•˜ź³  local minimum에 빠질 ģœ„ķ—˜ģ“ ė°œģƒķ•œė‹¤.

  • Non-stationary target problem

매 time-step ė§ˆė‹¤ q-network넼 updateķ•˜ėŠ” ģƒķ™©ģ„ ģƒź°ķ•“ė³“ģž. ģ•žģ„œ q-network ģ•Œź³ ė¦¬ģ¦˜ģ€ ė™ģ¼ķ•œ network넼 ģ“ģš©ķ•˜ģ—¬ q-targetź³¼ q-value넼 ź·¼ģ‚¬ķ•œė‹¤ź³  ģ„¤ėŖ…ķ•˜ģ˜€ė‹¤. ģš°ė¦¬ėŠ” ģµœģ ģ˜ q-value넼 ź·¼ģ‚¬ķ•˜źø° ģœ„ķ•“ģ„œ q-network넼 ź°±ģ‹ ķ•˜ź²Œ ė˜ėŠ”ė°, ė™ģ¼ķ•œ network넼 ģ‚¬ģš©ķ•˜ėŠ” ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķŠ¹ģ„±ģƒ network 갱신 주기에 ė”°ė¼ģ„œ q-targetģ˜ 근사값 ė˜ķ•œ ė³€ķ•˜ź²Œ ėœė‹¤. 즉 ģ—…ė°ģ“ķŠø ėŒ€ģƒģ“ ė˜ėŠ” q-targetģ“ ź³ ģ •ė˜ģ–“ ģžˆģ§€ ģ•Šź³  갱신할 ė•Œė§ˆė‹¤ ź°’ģ“ ė‹¬ė¼ģ§€ėŠ” ķ˜„ģƒģ“ ė°œģƒķ•œė‹¤. ģ“ė„¼ non-stationary target problemģ“ė¼ ģ¹­ķ•˜ė©° ķ”ķžˆ ģ›€ģ§ģ“ėŠ” 과녁에 ķ™”ģ‚“ģ„ ė§žģ¶”ėŠ” 것으딜 ė¹„ģœ ķ•œė‹¤.

ģœ„ 3가지 ė¬øģ œėŠ” 비단 q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķ•œź³„ 뿐만 ģ•„ė‹ˆė¼ ź°•ķ™”ķ•™ģŠµģ—ģ„œ neural networkģ˜ ķ•œź³„ė”œ 받아들여지고 ģžˆģ—ˆė‹¤. DQNģ€ 비교적 ź°„ė‹Øķ•œ idea넼 통핓 ģ“ė„¼ ķ•“ź²°ķ•˜ģ˜€ėŠ”ė° ė‹¤ģŒ ģ ˆģ„ ķ†µķ•“ģ„œ ģ•Œģ•„ė³“ė„ė” ķ•˜ģž.

Deep q-network

RL-backgroundģ˜ 취지에 ė§žģ¶”ģ–“ ģ„øė¶€ģ ģø preprocessing ź³¼ģ •ģ€ ģƒėžµķ•˜ź³  3가지 ė‹Øģ ģ„ ģ–“ė–»ź²Œ ź°œģ„ ķ•˜ģ˜€ėŠ”ģ§€ ģ¤‘ģ ģ ģœ¼ė”œ ģ•Œģ•„ė³“ģž.

Experience replay

DQNģ€ ģ•žģ„œ ģ–øźø‰ķ•œ 3가지 단점 중 1) ė¶€ģ”±ķ•œ ķ•™ģŠµ ė°ģ“ķ„°ģ™€ ė‚®ģ€ ė°ģ“ķ„° ķšØģœØģ„±, 2) high correlation between samples 문제넼 replay bufferģ˜ ė„ģž…ģœ¼ė”œ ķ•“ź²°ķ•˜ģ˜€ė‹¤. ģ“ė„¼ experience replayė¼ź³  ė¶€ė„“ėŠ”ė°, agentź°€ 매 time-step ė§ˆė‹¤ ķšė“ķ•œ transition <state, action, reward, next state> tupleģ„ buffer에 ģ €ģž„ 후 źŗ¼ė‚“ģ“°ėŠ” ė°©ģ‹ģ“ė‹¤. Replay bufferėŠ” first in first out ķ˜•ķƒœė”œ źø°ė”ė˜ė©° ģ‚¬ģš©ģžź°€ ģ„¤ģ •ķ•œ ķ•™ģŠµ ģ£¼źø°ė§ˆė‹¤ buffer에 ģ €ģž„ėœ transitionģ„ random samplingķ•˜ģ—¬ network넼 ķ•™ģŠµķ•œė‹¤. ė„ģ‹ķ™” ķ•˜ė©“ Figure 05와 같다.

Figure 05: experience replay

Figure 05ģ—ģ„œ 큰 ķ™”ģ‚“ķ‘œģ˜ 경우 environment와 ģƒķ˜øģž‘ģš©ķ•˜ėŠ” path넼 ėœ»ķ•˜ź³  실선 ķ™”ģ‚“ķ‘œģ˜ 경우 CNN ķ•™ģŠµģ— ģ‚¬ģš©ė˜ėŠ” transition path넼 ėœ»ķ•œė‹¤.

Environment와 ģƒķ˜øģž‘ģš©ķ•˜ėŠ” transitionė“¤ģ„ replay buffer에 ģ €ģž„ķ•˜ėŠ” ź²ƒģ„ 통핓 ė¶€ģ”±ķ•œ ķ•™ģŠµ ė°ģ“ķ„° ģ–‘ģ„ 볓충할 수 ģžˆė‹¤. ė˜ķ•œ CNNģ„ ķ•™ģŠµķ•  ė•Œ buffer딜 부터 random sampled transitionģ„ ķ™œģš©ķ•Øģœ¼ė”œģØ sampleź°„ ė†’ģ€ correlation 문제넼 ķ•“ź²°ķ•˜ģ˜€ė‹¤.

Fixed q-target

Non-stationary target problemģ€ q-targetź³¼ q-value넼 ė™ģ¼ķ•œ network딜 ģ¶”ģ •ķ–ˆźø° ė•Œė¬øģ— ė°œģƒķ•˜ģ˜€ė‹¤. ģ €ģžė“¤ģ€ ģ“ė„¼ ė¶„ė¦¬ķ•˜ģ—¬ ź°ź°ģ„ ė”°ė”œ ģ¶”ģ •ķ•˜ė„ė” 두 ź°œģ˜ network넼 ė‘ ģœ¼ė”œģØ ķ•“ź²°ķ•˜ģ˜€ė‹¤. Q-targetģ„ ģ¶”ģ •ķ•˜ėŠ” network넼 target network (θˉ\bar\theta) 딜 부넓며 ģ¼ģ • ģ£¼źø°ė§ˆė‹¤ q-value넼 ź·¼ģ‚¬ķ•˜ėŠ” network (Īø\theta)ģ˜ ź°€ģ¤‘ģ¹˜ė„¼ ė³µģ‚¬ķ•˜ģ—¬ ģ‚¬ģš©ķ•œė‹¤. ė§Œģ•½ 매 batchė§ˆė‹¤ q-value넼 ź·¼ģ‚¬ķ•˜ėŠ” network넼 ź°±ģ‹ ķ•˜ź³ , 32 batch ė§ˆė‹¤ target network딜 ź°€ģ¤‘ģ¹˜ė„¼ ė³µģ‚¬ķ•˜ė©“ Figure 06ź³¼ ź°™ģ€ ķ˜•ķƒœė„¼ ėˆė‹¤.

Figure 06: fixed q-target

ģ“ģ™€ ź°™ģ“ 두 ź°œģ˜ network넼 두고 각각 q-targetź³¼ q-value넼 ź·¼ģ‚¬ķ•Øģœ¼ė”œģØ non-stationary target problemģ„ ķ•“ź²°ķ•  수 ģžˆė‹¤.

ģ •ė¦¬ķ•˜ė©“ DQNģ€ replay bufferģ˜ ė„ģž…(experience replay)ź³¼ q-value넼 ģ¶”ģ •ķ•˜ėŠ” network와 ė…ė¦½ėœ q-target network넼 ģ‚¬ģš©ķ•Øģœ¼ė”œģØ 기씓 q-network ģ•Œź³ ė¦¬ģ¦˜ģ˜ ė‹Øģ ģ„ ź°œģ„ ķ•  수 ģžˆģ—ˆė‹¤. ė³µģž”ķ•œ 문제넼 비교적 ź°„ė‹Øķ•œ ė°©ė²•ģœ¼ė”œ ķ•“ź²°ķ–ˆė‹¤ėŠ” ģƒź°ģ“ 들지 ģ•ŠėŠ”ź°€? DQNģ˜ pseudo codeėŠ” ė‹¤ģŒź³¼ 같다.

Figure 07: pseudo code of DQN

Q-network pseudo code와 달리 targetģ„ ģ¶”ģ •ķ•˜ėŠ” network와 q-value넼 ź·¼ģ‚¬ķ•˜ėŠ” networkź°€ 다넸 ź²ƒģ„ ķ™•ģøķ•  수 ģžˆģœ¼ė©° ģ¼ģ • ģ£¼źø°ė§ˆė‹¤ target network딜 ź°€ģ¤‘ģ¹˜ė„¼ ė³µģ‚¬ķ•˜ėŠ” ėŖØģŠµģ„ ė³¼ 수 ģžˆė‹¤.

Results

DQNģ€ Atari ķ™˜ź²½ģ˜ ėŒ€ė¶€ė¶„ģ˜ ź²Œģž„ģ—ģ„œ ģ‚¬ėžŒģ˜ ģ„±ėŠ„ģ„ ė„˜ź±°ė‚˜ 그에 ģ¤€ķ•˜ėŠ” ģ„±ėŠ„ģ„ ė³“ģ˜€ė‹¤. Figure 08ģ€ 2015 Nature에 ź²Œģž¬ėœ 논문에 źø°ė”ėœ ģ„±ėŠ„ ė„ķ‘œģ“ė‹¤.

Figure 08: DQN's performance on Atari games

ģ•žģ„œ DQNģ€ experience replay와 fixed q-targetģ„ 통핓 ė†’ģ€ ģ„±ėŠ„ģ„ 달성할 수 ģžˆģ—ˆė‹¤ź³  ė°ķ˜”ė‹¤. 그렇다멓 ģ“ 두 가지 ė°©ė²•ģ˜ contributionģ€ ģ–“ėŠģ •ė„ 될까? ģ“ģ— ėŒ€ķ•œ ablation studyėŠ” Figure 09와 같다.

Figure 09: ablation sutdy

ģ œģ¼ ģš°ģø”ģ“ źø°ģ”“ģ˜ q-network algorithm에 ķ•“ė‹¹ķ•œė‹¤. Breakoutģ„ źø°ģ¤€ģœ¼ė”œ fixed q-targetģ€ q-network algorithm ėŒ€ė¹„ 약 3ė°°ģ˜ ģ„±ėŠ„ ķ–„ģƒģ„, experience replayėŠ” 약 80ė°°ģ˜ ģ„±ėŠ„ ķ–„ģƒģ„ 볓여준다. ė§ˆģ§€ė§‰ģœ¼ė”œ ģ“ ė‘˜ģ„ 모두 ķ•©ķ•œ DQNģ€ 약 100ė°°ģ˜ ģ„±ėŠ„ ķ–„ģƒģ„ 볓여준다.

DQNģ€ ģ•žģ„œ ģ–øźø‰ķ•˜ģ˜€ė“Æ ė³µģž”ķ•œ 문제넼 비교적 ź°„ė‹Øķ•œ ė°©ė²•ģœ¼ė”œ ķ•“ź²°ķ•œ ģ•Œź³ ė¦¬ģ¦˜ģ“ė‹¤. ė˜ķ•œ deep reinforce-ment learning (심층 ź°•ķ™”ķ•™ģŠµ)ģ˜ ķ¬ė¬øģ„ ģ—“ģ–“ģ –ķžŒ ģ•Œź³ ė¦¬ģ¦˜ģœ¼ė”œ źø°ģ—¬ķ•˜ėŠ” 바가 커 Natureģ—ė„ ė“±ģž¬ė˜ģ—ˆė‹¤. DQNģ“ ģ²˜ģŒ ė‚˜ģ˜Ø ģ§€ 8ė…„ģ—¬ģ˜ ģ‹œź°„ģ“ ķė„ø 만큼 ģ“ė„¼ ź°œģ„ ķ•œ ģ•Œź³ ė¦¬ģ¦˜ģ“ ė§Žģ“ ė“±ģž„ķ•˜ģ˜€ė‹¤.

ģ—¬źø°ź¹Œģ§€ ģ½ģ€ ė…ģžė“¤ģ€ 과연 DQNģ˜ ģ–“ė–¤ ģ ģ„ ė” ź°œģ„ ķ•  수 ģžˆģ„ź²ƒģœ¼ė”œ ė³“ģ“ėŠ”ź°€? ė‹¤ģŒ ģž„ģœ¼ė”œ ė„˜ģ–“ź°€źø° 전에 ģž ź¹ģ˜ ģƒź°ģ„ 가져볓멓 ģ¢‹ģ„ 것 같다.

Reference

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.

ķŒØķ‚¤ģ§€ ģ—†ģ“ R딜 źµ¬ķ˜„ķ•˜ėŠ” 심층 ź°•ķ™”ķ•™ģŠµ [yes24]arrow-up-right [교볓문고]arrow-up-right

Last updated