fbpx
Wikipedia

State–action–reward–state–action

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note[1] with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S1", the action the agent chooses "A1", the reward "R2" the agent gets for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" the agent chooses in its new state. The acronym for the quintuple (St, At, Rt+1, St+1, At+1) is SARSA.[2] Some authors use a slightly different convention and write the quintuple (St, At, Rt, St+1, At+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention.

Algorithm edit

 

A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation.

Watkin's Q-learning updates an estimate of the optimal state-action value function   based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy.

Some optimizations of Watkin's Q-learning may be applied to SARSA.[3]

Hyperparameters edit

Learning rate (alpha) edit

The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

Discount factor (gamma) edit

The discount factor determines the importance of future rewards. A discount factor factor of 0 makes the agent "opportunistic", or "myopic", e.g. [4], by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the   values may diverge.

Initial conditions (Q(S0, A0)) edit

Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high (infinite) initial value, also known as "optimistic initial conditions",[5] can encourage exploration: no matter what action takes place, the update rule causes it to have higher values than the other alternative, thus increasing their choice probability. In 2013 it was suggested that the first reward   could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of  . This allows immediate learning in case of fixed deterministic rewards. This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.[6]

See also edit

References edit

  1. ^ Online Q-Learning using Connectionist Systems" by Rummery & Niranjan (1994)
  2. ^ Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto (chapter 6.4)
  3. ^ Wiering, Marco; Schmidhuber, Jürgen (1998-10-01). "Fast Online Q(λ)" (PDF). Machine Learning. 33 (1): 105–115. doi:10.1023/A:1007562800292. ISSN 0885-6125. S2CID 8358530.
  4. ^ "Arguments against myopic training". Retrieved 17 May 2023.
  5. ^ "2.7 Optimistic Initial Values". incompleteideas.net. Retrieved 2018-02-28.
  6. ^ Shteingart, H; Neiman, T; Loewenstein, Y (May 2013). "The Role of First Impression in Operant Learning" (PDF). J Exp Psychol Gen. 142 (2): 476–88. doi:10.1037/a0029550. PMID 22924882.

state, action, reward, state, action, other, uses, sarsa, sarsa, algorithm, learning, markov, decision, process, policy, used, reinforcement, learning, area, machine, learning, proposed, rummery, niranjan, technical, note, with, name, modified, connectionist, . For other uses see Sarsa State action reward state action SARSA is an algorithm for learning a Markov decision process policy used in the reinforcement learning area of machine learning It was proposed by Rummery and Niranjan in a technical note 1 with the name Modified Connectionist Q Learning MCQ L The alternative name SARSA proposed by Rich Sutton was only mentioned as a footnote This name reflects the fact that the main function for updating the Q value depends on the current state of the agent S1 the action the agent chooses A1 the reward R2 the agent gets for choosing this action the state S2 that the agent enters after taking that action and finally the next action A2 the agent chooses in its new state The acronym for the quintuple St At Rt 1 St 1 At 1 is SARSA 2 Some authors use a slightly different convention and write the quintuple St At Rt St 1 At 1 depending on which time step the reward is formally assigned The rest of the article uses the former convention Contents 1 Algorithm 2 Hyperparameters 2 1 Learning rate alpha 2 2 Discount factor gamma 2 3 Initial conditions Q S0 A0 3 See also 4 ReferencesAlgorithm editQ n e w S t A t 1 a Q S t A t a R t 1 g Q S t 1 A t 1 displaystyle Q new S t A t leftarrow 1 alpha Q S t A t alpha R t 1 gamma Q S t 1 A t 1 nbsp A SARSA agent interacts with the environment and updates the policy based on actions taken hence this is known as an on policy learning algorithm The Q value for a state action is updated by an error adjusted by the learning rate a Q values represent the possible reward received in the next time step for taking action a in state s plus the discounted future reward received from the next state action observation Watkin s Q learning updates an estimate of the optimal state action value function Q displaystyle Q nbsp based on the maximum reward of available actions While SARSA learns the Q values associated with taking the policy it follows itself Watkin s Q learning learns the Q values associated with taking the optimal policy while following an exploration exploitation policy Some optimizations of Watkin s Q learning may be applied to SARSA 3 Hyperparameters editLearning rate alpha edit The learning rate determines to what extent newly acquired information overrides old information A factor of 0 will make the agent not learn anything while a factor of 1 would make the agent consider only the most recent information Discount factor gamma edit The discount factor determines the importance of future rewards A discount factor factor of 0 makes the agent opportunistic or myopic e g 4 by only considering current rewards while a factor approaching 1 will make it strive for a long term high reward If the discount factor meets or exceeds 1 the Q displaystyle Q nbsp values may diverge Initial conditions Q S0 A0 edit Since SARSA is an iterative algorithm it implicitly assumes an initial condition before the first update occurs A high infinite initial value also known as optimistic initial conditions 5 can encourage exploration no matter what action takes place the update rule causes it to have higher values than the other alternative thus increasing their choice probability In 2013 it was suggested that the first reward r displaystyle r nbsp could be used to reset the initial conditions According to this idea the first time an action is taken the reward is used to set the value of Q displaystyle Q nbsp This allows immediate learning in case of fixed deterministic rewards This resetting of initial conditions RIC approach seems to be consistent with human behavior in repeated binary choice experiments 6 See also editPrefrontal cortex basal ganglia working memory Sammon mapping Constructing skill trees Q learning Temporal difference learning Reinforcement learningReferences edit Online Q Learning using Connectionist Systems by Rummery amp Niranjan 1994 Reinforcement Learning An Introduction Richard S Sutton and Andrew G Barto chapter 6 4 Wiering Marco Schmidhuber Jurgen 1998 10 01 Fast Online Q l PDF Machine Learning 33 1 105 115 doi 10 1023 A 1007562800292 ISSN 0885 6125 S2CID 8358530 Arguments against myopic training Retrieved 17 May 2023 2 7 Optimistic Initial Values incompleteideas net Retrieved 2018 02 28 Shteingart H Neiman T Loewenstein Y May 2013 The Role of First Impression in Operant Learning PDF J Exp Psychol Gen 142 2 476 88 doi 10 1037 a0029550 PMID 22924882 Retrieved from https en wikipedia org w index php title State action reward state action amp oldid 1189680344, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.