The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
May. 30, 2023

Filed:

Sep. 27, 2019
Applicant:

Deepmind Technologies Limited, London, GB;

Inventors:

Scott Ellison Reed, New York, NY (US);

Yusuf Aytar, London, GB;

Ziyu Wang, St. Albans, GB;

Tom Paine, London, GB;

Sergio Gomez Colmenarejo, London, GB;

David Budden, London, GB;

Tobias Pfaff, London, GB;

Aaron Gerard Antonius van den Oord, London, GB;

Oriol Vinyals, London, GB;

Alexander Novikov, London, GB;

Assignee:
Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06N 3/006 (2023.01); G06F 17/16 (2006.01); G06N 3/08 (2023.01); G06F 18/22 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01); G06V 10/764 (2022.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01);
U.S. Cl.
CPC ...
G06N 3/006 (2013.01); G06F 17/16 (2013.01); G06F 18/22 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01); G06N 3/08 (2013.01); G06V 10/764 (2022.01); G06V 10/7715 (2022.01); G06V 10/82 (2022.01);
Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection policy neural network, wherein the action selection policy neural network is configured to process an observation characterizing a state of an environment to generate an action selection policy output, wherein the action selection policy output is used to select an action to be performed by an agent interacting with an environment. In one aspect, a method comprises: obtaining an observation characterizing a state of the environment subsequent to the agent performing a selected action; generating a latent representation of the observation; processing the latent representation of the observation using a discriminator neural network to generate an imitation score; determining a reward from the imitation score; and adjusting the current values of the action selection policy neural network parameters based on the reward using a reinforcement learning training technique.


Find Patent Forward Citations

Loading…