The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Nov. 24, 2020

Filed:

Jun. 27, 2018
Applicant:

Microsoft Technology Licensing, Llc, Redmond, WA (US);

Inventors:

Eyal Krupka, Redmond, WA (US);

Xiong Xiao, Bothell, WA (US);

Assignee:
Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G01L 17/00 (2006.01); G06K 9/00 (2006.01); G10L 25/84 (2013.01); G10L 21/0232 (2013.01); G06T 7/70 (2017.01); H04R 1/40 (2006.01); G10L 17/18 (2013.01); H04N 5/247 (2006.01); G10L 17/00 (2013.01);
U.S. Cl.
CPC ...
G10L 17/005 (2013.01); G06K 9/00288 (2013.01); G06T 7/70 (2017.01); G10L 21/0232 (2013.01); G10L 25/84 (2013.01); G06T 2207/30201 (2013.01); G10L 17/18 (2013.01); H04N 5/247 (2013.01); H04R 1/406 (2013.01);
Abstract

Multi-modal speech localization is achieved using image data captured by one or more cameras, and audio data captured by a microphone array. Audio data captured by each microphone of the array is transformed to obtain a frequency domain representation that is discretized in a plurality of frequency intervals. Image data captured by each camera is used to determine a positioning of each human face. Input data is provided to a previously-trained, audio source localization classifier, including: the frequency domain representation of the audio data captured by each microphone, and the positioning of each human face captured by each camera in which the positioning of each human face represents a candidate audio source. An identified audio source is indicated by the classifier based on the input data that is estimated to be the human face from which the audio data originated.


Find Patent Forward Citations

Loading…