The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Patent No.:

US 12169793 B1

Date of Patent:

Dec. 17, 2024

Filed:

Nov. 16, 2020

Approximate value iteration with complex returns by bounding

Applicant:

The Research Foundation for the State University of New York, Binghamton, NY (US);

Inventors:

Robert Wright, Sherrill, NY (US);

Lei Yu, Vestal, NY (US);

Steven Loscalzo, Vienna, VA (US);

Assignee:

The Research Foundation for The State University of New York, Binghamton, NY (US);

Attorneys:

Hoffberg & Associates

Steven M. Hoffberg

Primary Examiner:

Kamran Afshar

Assistant Examiner:

Brian J Hales

Int. Cl.

CPC ...

G06N 7/01 (2023.01); G05B 13/02 (2006.01); G05B 15/02 (2006.01); G06N 20/00 (2019.01);

U.S. Cl.

CPC ...

G06N 7/01 (2023.01); G05B 13/0265 (2013.01); G05B 15/02 (2013.01); G06N 20/00 (2019.01); Y02B 10/30 (2013.01);

Abstract

A system and method for controlling a system, comprising estimating an optimal control policy for the system; receiving data representing sequential states and associated trajectories of the system, comprising off-policy states and associated off-policy trajectories; improving the estimate of the optimal control policy by performing at least one approximate value iteration, comprising: estimating a value of operation of the system dependent on the estimated optimal control policy; using a complex return of the received data, biased by the off-policy states, to determine a bound dependent on at least the off-policy trajectories, and using the bound to improve the estimate of the value of operation of the system according to the estimated optimal control policy; and updating the estimate of the optimimal control policy, dependent on the improved estimate of the value of operation of the system. The control system may produce an output signal to control the system directly, or output the optimized control policy. The system preferably is a reinforcement learning system which continually improves.

Find Patent Forward Citations