The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Feb. 21, 2023

Filed:

Apr. 02, 2021
Applicant:

Baidu Usa, Llc, Sunnyvale, CA (US);

Inventors:

Sibo Zhang, San Jose, CA (US);

Jiahong Yuan, Cherry Hill, NJ (US);

Miao Liao, San Jose, CA (US);

Liangjun Zhang, Cupertino, CA (US);

Assignee:

Baidu USA LLC, Sunnyvale, CA (US);

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G10L 13/08 (2013.01); G10L 13/02 (2013.01); G10L 15/187 (2013.01); G06F 16/783 (2019.01); G06F 16/78 (2019.01); G06F 40/242 (2020.01); G10L 13/027 (2013.01); G06N 3/04 (2023.01); G06N 3/08 (2023.01);
U.S. Cl.
CPC ...
G10L 13/08 (2013.01); G06F 16/7834 (2019.01); G06F 16/7867 (2019.01); G06F 40/242 (2020.01); G06N 3/04 (2013.01); G06N 3/08 (2013.01); G10L 13/027 (2013.01); G10L 15/187 (2013.01);
Abstract

Presented herein are novel approaches to synthesize video of the speech from text. In a training phase, embodiments build a phoneme-pose dictionary and train a generative neural network model using a generative adversarial network (GAN) to generate video from interpolated phoneme poses. In deployment, the trained generative neural network in conjunction with the phoneme-pose dictionary convert an input text into a video of a person speaking the words of the input text. Compared to audio-driven video generation approaches, the embodiments herein have a number of advantages: 1) they only need a fraction of the training data used by an audio-driven approach; 2) they are more flexible and not subject to vulnerability due to speaker variation; and 3) they significantly reduce the preprocessing, training, and inference times.


Find Patent Forward Citations

Loading…