The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Patent No.:

US 12288380 B1

Date of Patent:

Apr. 29, 2025

Filed:

May. 16, 2022

Systems and methods for unified vision-language understanding and generation

Applicant:

Salesforce, Inc., San Francisco, CA (US);

Inventors:

Junnan Li, Singapore, SG;

Chu Hong Hoi, Singapore, SG;

Assignee:

Salesforce, Inc., San Francisco, CA (US);

Attorney:

Haynes and Boone, LLP

Primary Examiner:

Michael Robert Cammarata

Int. Cl.

CPC ...

G06V 10/774 (2021.12); G06F 40/126 (2019.12); G06F 40/284 (2019.12); G06T 9/00 (2005.12); G06V 10/764 (2021.12); G06V 10/80 (2021.12);

U.S. Cl.

CPC ...

G06V 10/774 (2021.12); G06F 40/126 (2019.12); G06F 40/284 (2019.12); G06T 9/00 (2012.12); G06V 10/764 (2021.12); G06V 10/803 (2021.12);

Abstract

Embodiments described herein provide systems, methods, and devices for generating enhanced vison-language training data. A method may include: receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs; fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs; generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset; generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text; adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and training a vision-language model using the third training dataset of image-text pairs.

Find Patent Forward Citations