[arXiv 2013] Playing Atari with Deep Reinforcement Learning
Paper url: https://arxiv.org/pdf/1312.5602.pdf
Author and affiliation

Introduction
Q-learning ģ¬ė”ģģ ģøźøķģėÆģ“ ź³ ģ°Øģģ state-action spaceģģ table źø°ė°ģ ź°ķķģµģ ģ ģ©ķ ģ ģź±°ė ė§ģ ģģ memory resource넼 ķģė”ķźø° ė문ģ ģ ķ©ķģ§ ģė¤. ģ“넼 ģķ“ ģ¤ģ q-value넼 ėģ¶ ķ tableģ ģ ģ„ķė ė°©ģģ ģ¬ģ©ķģ§ ģź³ ķ¹ģ ķØģ(e.g. linear combination, decision tree, support vector machine)넼 ģ“ģ©ķģ¬ ź·¼ģ¬ķė ė°©ė²ģ“ ė±ģ„ķģė¤. ė¤ģķ ķØģ넼 ķµķ“ģ q-value넼 ź·¼ģ¬ķ ģ ģģ¼ė ė³ø ė ¼ė¬øģ“ ė°ķėė ģģ ź³¼ ė§ė¬¼ė ¤ ģ¬ė¬ task (ķ¹ķ computer vision)ģģ ėź°ģ 볓ģ“ė neural networkģ“ ģ£¼ė” ģ¬ģ©ėė©° ģ“ė ģ¬ģøµ ź°ķķģµ(deep reinforcement learning)ģ ź°ė ģ ė리 ģė¦¬ź² ėģė¤.
ė³ø ė ¼ė¬øģ 2013ė arXiv preprint ķ ģøė¶ ė“ģ©ģ ģ¶ź°ķģ¬ 2015ė Natureģ ź²ģ¬ [1]ė ė° ģė¤. ė ¼ė¬øģģ ģ ģė deep q-network (DQN)ģ ź°ģ¹ źø°ė° ź°ķķģµ(value-based reinforcement learning)ģ ėķģ ģø ģź³ 리ģ¦ģ¼ė” ģ리 ģ”ź³ ģė¤. Figure 02ė Atari-Breakoutģ ķė ģ“ķė DQN agentģ ėŖØģµģ ė³¼ ģ ģģ¼ė©° ģ½ 400ė²ģ ķģµ ķģ(episode) ķģė ģ¬ėģ“ ķė ź²ź³¼ ģ ģ¬ķ ģ±ė„ģ 볓ģøė¤. ė ėė¼ģ“ ģ ģ 600ė²ģ ķģµģ ź°ģ§ agentė ģ¬ėė ģź°ķźø° ķė ģ ėµģ ķ°ėķėė°, ķģŖ½ģ ė²½ėģ ė«ģ ķ ź³µģ ź·ø ģė” ė³“ė“ģ ģµģķģ ģģ§ģģ¼ė” ė§ģ ė²½ėģ ź¹Øė ė°©ė²ģ“ ź·øź²ģ“ė¤.
ź·øė ė¤ė©“ ģ“ė¬ķ agentė ģ“ė»ź² ė§ė¤ ģ ģģź¹? DQNģ ģ¤ėŖ ķźø°ģ ģģ neural network넼 ģ¬ģ©ķģ¬ q-value넼 ź·¼ģ¬ķė ź°ģ„ źø°ģ“ģ ģø ģź³ 리ģ¦ģ ģģ볓ģ.
Q-network algorithm
DQNģ ź°ķķģµģ neural network넼 ģ ėŖ©ģķØ ģµģ“ģ ģź³ 리ģ¦ģ“ ģėė¤. 기씓ģ neural network źø°ė°ģ ģ¬ė¬ ģėė¤ģ“ ģģź³ ź°ķķģµģė ģ ź²½ė§ģ“ ģ ķ©ķģ§ ģė¤ė ģź²¬ģ“ ė¤ģģė¤. 기씓ģ ė°©ė²ģ ģ“ė¤ ė¬øģ ģ ģ“ ģģģź¹? ģ°ģ q-value넼 ź·¼ģ¬ķė networkė state넼 inputģ¼ė” ė°ź³ actionģ ėķ q-value넼 outputģ¼ė” ģ¼ėė¤. 그림ģ¼ė” ķķķė©“ Figure 03ź³¼ ź°ė¤.

Neural network넼 ķģµķźø° ģķ“ģė ėŖØėøģ ģģø”ź°ź³¼ ģ ėµģ“ ķģķė¤. ģģø”ź°ģ input state넼 feed forwardķė©“ 구ķ ģ ģź³ ģ ėµģ 묓ģģ“ ė ź¹? ė³µģ”ķź² ģź°ķ ķģź° ģė¤. Q-network algorithmģ ģģģ ė°°ģ“ q-learning ģź³ 리ģ¦ģ neural networkė” ģ®źø°ė ź³¼ģ ģ ė¶ź³¼ķė¤ ź²ģ ģģ¼ė©“ ģėė¤. Q-learningģģ q-value넼 ź°±ģ ķ ė ė¤ģ ģģ ģ¬ģ©ķģė¤.
ģ ģģģ ź°±ģ ģ ėģ(q-target)ģ Rt+1ā+γmaxaā²āQ(St+1ā,aā²)ģ“ė©° q-network algorithmė ģ“ģ ėģ¼ķė¤. Q-network ģź³ 리ģ¦ģ ķģµ ėģ(q-target)ģ notationģ ė¤ģź³¼ ź°ė¤.
Q-targetģ ėķ notationģ ķģøķ“볓멓, next stateź° ģ¢ ė£ ģķ(terminal state)ģ¼ ź²½ģ°ė 볓ģģ targetģ¼ė”, non-terminal stateė¼ė©“ q-learningź³¼ ėģ¼ķź² ė³“ģ + ź°ź°ė ė¤ģ ģķģ maximum q-value넼 targetģ¼ė” ģ¼ė ź²ģ ķģøķ ģ ģė¤. ķ ź°ģ§ ė¤ė„ø ģ ģ q-targetģ maximum q-value ėķ q-networkė” ģ¶ģ ė ź°ģ“ė¼ė ģ¬ģ¤ģ“ė¤.
Q-network ģź³ 리ģ¦ģ q-targetź³¼ state넼 feed forward ķģ ė ģ¶ė „ėė ģģø”ź° ģ¬ģ“ģ ģ°Øģ“넼 ģ¤ģ“ė ė°©ķ„ģ¼ė” ķģµģ ģ§ķķė¤. ģ“ė ģ¼ė°ģ ģø regression task넼 ķ ėģ ėģ¼ķź² mean squared error (MSE)넼 ģ¤ģ“ė ź²ź³¼ ėģ¼ķė¤. Q-network ģź³ 리ģ¦ģ ėŖ©ģ ķØģ넼 ģ ģ¼ė©“ ė¤ģź³¼ ź°ė¤.
ķģ¬ ģģ ģ Q-valueģ ė¤ģ ģģ ģ q-targetģ“ neural networkģ outputģ“ė¼ė ģ¬ģ¤ģ ģ ģøķė©“ q-learning agentģ 100% ėģ¼ķė¤. Q-network ģź³ 리ģ¦ģ pseudo codeė ė¤ģź³¼ ź°ė¤.

Q-network agent ėķ ϵģ ķė„ ė” random actionģ, (1āϵ)ģ ķė„ ė” maximum q-value ķėģ ģ·Øķė ϵāgreedy policy넼 ė°ė„øė¤ė ź²ģ ģ ģ ģė¤.
기씓 Q-learning ģź³ 리ģ¦ģģ q-value넼 ģ ź²½ė§ģ¼ė” ź·¼ģ¬ķė ė¶ė¶ė§ ė³ź²½ė ģ“ ģź³ 리ģ¦ģ ģ“ė ķ ėØģ ģ“ ģģź¹? Q-network ģź³ 리ģ¦ģ ėØģ ģ“ģ, DQN ģ“ģ ģ neural network źø°ė°ģ ź°ķķģµ ėØģ ģ ķ¬ź² 3ź°ģ§ė” 1) ė¶ģ”±ķ ķģµ ė°ģ“ķ° ė° ė®ģ ė°ģ“ķ° ķØģØģ±, 2) high correlation between samples, 3) non-stationary target problemģ ė¤ ģ ģė¤.
ė¶ģ”±ķ ķģµ ė°ģ“ķ° ė° ė®ģ ė°ģ“ķ° ķØģØģ±
ģ¼ė°ģ ģ¼ė” neural networkė ķģµ ė°ģ“ķ°ź° ė§ģ supervised, unsupervised learningģ ķØģØģ ģø ė°©ė²ģ“ė¤. ź°ķķģµģģ ģź° ģ°Ø ķģµ(temporal difference learning, TD)ģ ź²½ģ° ķė² updateģ ģ¬ģ©ķ transitionģ ė¤ģ ģ¬ģ©ķģ§ ģėė¤. ģ“ ė문ģ neural network넼 ķģµģķ¤źø°ģ ģ¶©ė¶ķ ė°ģ“ķ°ė„¼ ķė³“ķźø° ķė¤ė©° 과거ģ ģ¢ģ transitionģ“ ķė°ėė ėØģ ģ“ ģė¤.
High correlation between samples
ź°ķķģµģ ķģµ ė°ģ“ķ°ģ ķ“ė¹ķė transitionģ time-step ź° correlationģ“ źµģ„ģ“ ėė¤. ģ“ė ģ§ģ transitionģģ actionģ ģķ“ ķģ¬ ė° ė¤ģ transitionģ“ ź²°ģ ėźø° ė문ģ“ė¤. ģ¼ģģķģģ ģ ė ė©ė“넼 ź³ ė„“ė ė° ģģ“ ģ ģ¬, ķ¹ģ ģ“ģ ģ ė ģ 먹ģė ė©ė“ź° ģķ„ģ ģ£¼ė ź²ź³¼ ģ ģ¬ķė¤. Sample ź° correlationģ“ ėģ ź²½ģ° networkź° ź±°ģ³ģØ transition ģģ ė¹ķ“ ģ ģ미ķ ķģµėģ“ ė§ģ§ ģź±°ėģ(ģ ģ¬ķ sample, ė¹ģ·ķ action, ė®ģ error), ģ“źø° transitionģģ ģ ķķ actionģ ģķ“ ģģ¼ė”ģ sampleė¤ģ“ ģ¢ ģėė ķģģ“ ė°ģķė¤. ė°ė¼ģ global optimumģ ėė¬ķģ§ ėŖ»ķź³ local minimumģ ė¹ ģ§ ģķģ“ ė°ģķė¤.
Non-stationary target problem
매 time-step ė§ė¤ q-network넼 updateķė ģķ©ģ ģź°ķ“볓ģ. ģģ q-network ģź³ 리ģ¦ģ ėģ¼ķ network넼 ģ“ģ©ķģ¬ q-targetź³¼ q-value넼 ź·¼ģ¬ķė¤ź³ ģ¤ėŖ ķģė¤. ģ°ė¦¬ė ģµģ ģ q-value넼 ź·¼ģ¬ķźø° ģķ“ģ q-network넼 ź°±ģ ķź² ėėė°, ėģ¼ķ network넼 ģ¬ģ©ķė ģź³ 리ģ¦ģ ķ¹ģ±ģ network ź°±ģ 주기ģ ė°ė¼ģ q-targetģ ź·¼ģ¬ź° ėķ ė³ķź² ėė¤. ģ¦ ģ ė°ģ“ķø ėģģ“ ėė q-targetģ“ ź³ ģ ėģ“ ģģ§ ģź³ ź°±ģ ķ ėė§ė¤ ź°ģ“ ė¬ė¼ģ§ė ķģģ“ ė°ģķė¤. ģ“넼 non-stationary target problemģ“ė¼ ģ¹ķė©° ķķ ģģ§ģ“ė ź³¼ė ģ ķģ“ģ ė§ģ¶ė ź²ģ¼ė” ė¹ģ ķė¤.
ģ 3ź°ģ§ 문ģ ė ė¹ėØ q-network ģź³ 리ģ¦ģ ķź³ ėæė§ ģėė¼ ź°ķķģµģģ neural networkģ ķź³ė” ė°ģė¤ģ¬ģ§ź³ ģģė¤. DQNģ ė¹źµģ ź°ėØķ idea넼 ķµķ“ ģ“넼 ķ“ź²°ķģėė° ė¤ģ ģ ģ ķµķ“ģ ģģ볓ėė” ķģ.
Deep q-network
RL-backgroundģ ģ·Øģ§ģ ė§ģ¶ģ“ ģøė¶ģ ģø preprocessing ź³¼ģ ģ ģėµķź³ 3ź°ģ§ ėØģ ģ ģ“ė»ź² ź°ģ ķģėģ§ ģ¤ģ ģ ģ¼ė” ģģ볓ģ.
Experience replay
DQNģ ģģ ģøźøķ 3ź°ģ§ ėØģ ģ¤ 1) ė¶ģ”±ķ ķģµ ė°ģ“ķ°ģ ė®ģ ė°ģ“ķ° ķØģØģ±, 2) high correlation between samples 문ģ 넼 replay bufferģ ėģ ģ¼ė” ķ“ź²°ķģė¤. ģ“넼 experience replayė¼ź³ ė¶ė„“ėė°, agentź° ė§¤ time-step ė§ė¤ ķėķ transition <state, action, reward, next state> tupleģ bufferģ ģ ģ„ ķ źŗ¼ė“ģ°ė ė°©ģģ“ė¤. Replay bufferė first in first out ķķė” źø°ė”ėė©° ģ¬ģ©ģź° ģ¤ģ ķ ķģµ ģ£¼źø°ė§ė¤ bufferģ ģ ģ„ė transitionģ random samplingķģ¬ network넼 ķģµķė¤. ėģķ ķė©“ Figure 05ģ ź°ė¤.

Figure 05ģģ ķ° ķģ“ķģ ź²½ģ° environmentģ ģķøģģ©ķė path넼 ė»ķź³ ģ¤ģ ķģ“ķģ ź²½ģ° CNN ķģµģ ģ¬ģ©ėė transition path넼 ė»ķė¤.
Environmentģ ģķøģģ©ķė transitionė¤ģ replay bufferģ ģ ģ„ķė ź²ģ ķµķ“ ė¶ģ”±ķ ķģµ ė°ģ“ķ° ģģ 볓충ķ ģ ģė¤. ėķ CNNģ ķģµķ ė bufferė” ė¶ķ° random sampled transitionģ ķģ©ķØģ¼ė”ģØ sampleź° ėģ correlation 문ģ 넼 ķ“ź²°ķģė¤.
Fixed q-target
Non-stationary target problemģ q-targetź³¼ q-value넼 ėģ¼ķ networkė” ģ¶ģ ķźø° ė문ģ ė°ģķģė¤. ģ ģė¤ģ ģ“넼 ė¶ė¦¬ķģ¬ ź°ź°ģ ė°ė” ģ¶ģ ķėė” ė ź°ģ network넼 ė ģ¼ė”ģØ ķ“ź²°ķģė¤. Q-targetģ ģ¶ģ ķė network넼 target network (ĪøĖ) ė” ė¶ė„“ė©° ģ¼ģ 주기ė§ė¤ q-value넼 ź·¼ģ¬ķė network (Īø)ģ ź°ģ¤ģ¹ė„¼ ė³µģ¬ķģ¬ ģ¬ģ©ķė¤. ė§ģ½ 매 batchė§ė¤ q-value넼 ź·¼ģ¬ķė network넼 ź°±ģ ķź³ , 32 batch ė§ė¤ target networkė” ź°ģ¤ģ¹ė„¼ ė³µģ¬ķė©“ Figure 06ź³¼ ź°ģ ķķ넼 ėė¤.

ģ“ģ ź°ģ“ ė ź°ģ network넼 ėź³ ź°ź° q-targetź³¼ q-value넼 ź·¼ģ¬ķØģ¼ė”ģØ non-stationary target problemģ ķ“ź²°ķ ģ ģė¤.
ģ 리ķė©“ DQNģ replay bufferģ ėģ (experience replay)ź³¼ q-value넼 ģ¶ģ ķė networkģ ė 립ė q-target network넼 ģ¬ģ©ķØģ¼ė”ģØ źø°ģ”“ q-network ģź³ 리ģ¦ģ ėØģ ģ ź°ģ ķ ģ ģģė¤. ė³µģ”ķ 문ģ 넼 ė¹źµģ ź°ėØķ ė°©ė²ģ¼ė” ķ“ź²°ķė¤ė ģź°ģ“ ė¤ģ§ ģėź°? DQNģ pseudo codeė ė¤ģź³¼ ź°ė¤.

Q-network pseudo codeģ ė¬ė¦¬ targetģ ģ¶ģ ķė networkģ q-value넼 ź·¼ģ¬ķė networkź° ė¤ė„ø ź²ģ ķģøķ ģ ģģ¼ė©° ģ¼ģ 주기ė§ė¤ target networkė” ź°ģ¤ģ¹ė„¼ ė³µģ¬ķė ėŖØģµģ ė³¼ ģ ģė¤.
Results
DQNģ Atari ķź²½ģ ėė¶ė¶ģ ź²ģģģ ģ¬ėģ ģ±ė„ģ ėź±°ė ź·øģ ģ¤ķė ģ±ė„ģ 볓ģė¤. Figure 08ģ 2015 Natureģ ź²ģ¬ė ė ¼ė¬øģ źø°ė”ė ģ±ė„ ėķģ“ė¤.

ģģ DQNģ experience replayģ fixed q-targetģ ķµķ“ ėģ ģ±ė„ģ ė¬ģ±ķ ģ ģģė¤ź³ ė°ķė¤. ź·øė ė¤ė©“ ģ“ ė ź°ģ§ ė°©ė²ģ contributionģ ģ“ėģ ė ė ź¹? ģ“ģ ėķ ablation studyė Figure 09ģ ź°ė¤.

ģ ģ¼ ģ°ģø”ģ“ źø°ģ”“ģ q-network algorithmģ ķ“ė¹ķė¤. Breakoutģ źø°ģ¤ģ¼ė” fixed q-targetģ q-network algorithm ėė¹ ģ½ 3ė°°ģ ģ±ė„ ķ„ģģ, experience replayė ģ½ 80ė°°ģ ģ±ė„ ķ„ģģ 볓ģ¬ģ¤ė¤. ė§ģ§ė§ģ¼ė” ģ“ ėģ ėŖØė ķ©ķ DQNģ ģ½ 100ė°°ģ ģ±ė„ ķ„ģģ 볓ģ¬ģ¤ė¤.
DQNģ ģģ ģøźøķģėÆ ė³µģ”ķ 문ģ 넼 ė¹źµģ ź°ėØķ ė°©ė²ģ¼ė” ķ“ź²°ķ ģź³ 리ģ¦ģ“ė¤. ėķ deep reinforce-ment learning (ģ¬ģøµ ź°ķķģµ)ģ ķ¬ė¬øģ ģ“ģ“ģ ķ ģź³ 리ģ¦ģ¼ė” źø°ģ¬ķė ė°ź° 커 Natureģė ė±ģ¬ėģė¤. DQNģ“ ģ²ģ ėģØ ģ§ 8ė ģ¬ģ ģź°ģ“ ķ넸 ė§ķ¼ ģ“넼 ź°ģ ķ ģź³ 리ģ¦ģ“ ė§ģ“ ė±ģ„ķģė¤.
ģ¬źø°ź¹ģ§ ģ½ģ ė ģė¤ģ ź³¼ģ° DQNģ ģ“ė¤ ģ ģ ė ź°ģ ķ ģ ģģź²ģ¼ė” 볓ģ“ėź°? ė¤ģ ģ„ģ¼ė” ėģ“ź°źø° ģ ģ ģ ź¹ģ ģź°ģ ź°ģ øė³“ė©“ ģ¢ģ ź² ź°ė¤.
Reference
[1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
ķØķ¤ģ§ ģģ“ Rė” źµ¬ķķė ģ¬ģøµ ź°ķķģµ [yes24] [źµė³“ė¬øź³ ]
Last updated