The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Apr. 16, 2013
Filed:
Dec. 03, 2010
Sorin Gherman, Kirkland, WA (US);
Kunal Mukerjee, Redmond, WA (US);
Sorin Gherman, Kirkland, WA (US);
Kunal Mukerjee, Redmond, WA (US);
Microsoft Corporation, Redmond, WA (US);
Abstract
The present invention extends to methods, systems, and computer program products for identifying key phrases within documents. Embodiments of the invention include using a tag index to determine what a document primarily relates to. For example, an integrated data flow and extract-transform-load pipeline, crawls, parses and word breaks large corpuses of documents in database tables. Documents can be broken into tuples. The tuples can be sent to a heuristically based algorithm that uses statistical language models and weight+cross-entropy threshold functions to summarize the document into its 'top N' most statistically significant phrases. Accordingly, embodiments of the invention scale efficiently (e.g., linearly) and (potentially large numbers of) documents can be characterized by salient and relevant key phrases (tags).