The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jan. 09, 2024

Filed:

Jun. 28, 2018
Applicant:

Deepmind Technologies Limited, London, GB;

Inventors:

Olivier Claude Pietquin, Lille, FR;

Martin Riedmiller, Balgheim, DE;

Wang Fumin, London, GB;

Bilal Piot, London, GB;

Mel Vecerik, London, GB;

Todd Andrew Hester, Seattle, WA (US);

Thomas Rothoerl, London, GB;

Thomas Lampe, London, GB;

Nicolas Manfred Otto Heess, London, GB;

Jonathan Karl Scholz, London, GB;

Assignee:
Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06N 3/02 (2006.01); G06N 3/08 (2023.01); G06N 3/045 (2023.01); G06N 3/047 (2023.01);
U.S. Cl.
CPC ...
G06N 3/08 (2013.01); G06N 3/045 (2023.01); G06N 3/047 (2023.01);
Abstract

An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.


Find Patent Forward Citations

Loading…