The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Dec. 02, 2025

Filed:

Jan. 24, 2024
Applicant:

Google Llc, Mountain View, CA (US);

Inventors:

Isaac Elias, Mountain View, CA (US);

Jonathan Shen, Mountain View, CA (US);

Yu Zhang, Mountain View, CA (US);

Ye Jia, Mountain View, CA (US);

Ron J. Weiss, New York, NY (US);

Yonghui Wu, Fremont, CA (US);

Byungha Chun, Tokyo, JP;

Assignee:

Google LLC, Mountain View, CA (US);

Attorneys:
Primary Examiner:
Int. Cl.
CPC ...
G10L 13/08 (2013.01); G06F 40/126 (2020.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/048 (2023.01); G06N 3/08 (2023.01); G06N 3/088 (2023.01); G10L 13/047 (2013.01);
U.S. Cl.
CPC ...
G10L 13/08 (2013.01); G06F 40/126 (2020.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 3/088 (2013.01); G10L 13/047 (2013.01); G06N 3/048 (2023.01);
Abstract

A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.


Find Patent Forward Citations

Loading…