The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Feb. 06, 2024

Filed:

Sep. 22, 2022
Applicant:

Google Llc, Mountain View, CA (US);

Inventors:

Inbar Mosseri, Raanana, IL;

Michael Rubinstein, Natick, MA (US);

Ariel Ephrat, Efrat, IL;

William Freeman, Acton, MA (US);

Oran Lang, Givatayim, IL;

Kevin William Wilson, Cambridge, MA (US);

Tali Dekel, Arlington, MA (US);

Avinatan Hassidim, Petah Tikva, IL;

Assignee:

Google LLC, Mountain View, CA (US);

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G10L 25/57 (2013.01); G10L 15/16 (2006.01); G10L 21/10 (2013.01); G10L 21/18 (2013.01); G06V 20/40 (2022.01); G06V 40/16 (2022.01); G10L 15/25 (2013.01); G06F 18/214 (2023.01); G10L 17/18 (2013.01);
U.S. Cl.
CPC ...
G10L 25/57 (2013.01); G06F 18/214 (2023.01); G06V 20/41 (2022.01); G06V 40/161 (2022.01); G10L 15/16 (2013.01); G10L 15/25 (2013.01); G10L 17/18 (2013.01); G10L 21/10 (2013.01); G10L 21/18 (2013.01);
Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.


Find Patent Forward Citations

Loading…