The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Patent No.:

US 12375766 B1

Date of Patent:

Jul. 29, 2025

Filed:

Sep. 30, 2022

Video synthesis via multimodal conditioning

Applicants:

Francesco Barbieri, Marina del Rey, CA (US);

Ligong Han, Edison, NJ (US);

Hsin-ying Lee, San Jose, CA (US);

Shervin Minaee, Bellevue, WA (US);

Kyle Olszewski, Los Angeles, CA (US);

Jian Ren, Hermosa Beach, CA (US);

Sergey Tulyakov, Santa Monica, CA (US);

Inventors:

Francesco Barbieri, Marina del Rey, CA (US);

Ligong Han, Edison, NJ (US);

Hsin-Ying Lee, San Jose, CA (US);

Shervin Minaee, Bellevue, WA (US);

Kyle Olszewski, Los Angeles, CA (US);

Jian Ren, Hermosa Beach, CA (US);

Sergey Tulyakov, Santa Monica, CA (US);

Assignee:

Snap Inc., Santa Monica, CA (US);

Attorneys:

CM Law

Stephen J. Weed

Primary Examiner:

Mishawn N Hunter

Int. Cl.

CPC ...

H04N 21/472 (2011.01); G06T 11/00 (2006.01);

U.S. Cl.

CPC ...

H04N 21/47205 (2013.01); G06T 11/00 (2013.01);

Abstract

A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens is used to improve video quality and consistency. Text augmentation is utilized to improve the robustness of the textual representation and diversity of generated videos. The framework incorporates various visual modalities, such as segmentation masks, drawings, and partially occluded images. In addition, the MMVID extracts visual information as suggested by a textual prompt.

Find Patent Forward Citations