The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Patent No.:

US 10896669 B1

Date of Patent:

Jan. 19, 2021

Filed:

May. 08, 2018

Systems and methods for multi-speaker neural text-to-speech

Applicant:

Baidu Usa, Llc, Sunnyvale, CA (US);

Inventors:

Sercan O. Arik, San Francisco, CA (US);

Gregory Diamos, San Jose, CA (US);

Andrew Gibiansky, Mountain View, CA (US);

John Miller, Palo Alto, CA (US);

Kainan Peng, Sunnyvale, CA (US);

Wei Ping, Sunnyvale, CA (US);

Jonathan Raiman, Palo Alto, CA (US);

Yanqi Zhou, San Jose, CA (US);

Assignee:

Baidu USA LLC, Sunnyvale, CA (US);

Attorney:

North Weber & Baugh LLP

Primary Examiner:

Qi Han

Int. Cl.

CPC ...

G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 15/04 (2013.01); G10L 15/06 (2013.01); G10L 25/30 (2013.01);

U.S. Cl.

CPC ...

G10L 13/08 (2013.01); G10L 15/04 (2013.01); G10L 15/063 (2013.01); G10L 25/30 (2013.01);

Abstract

Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Find Patent Forward Citations