The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jun. 10, 2003

Filed:

Jun. 28, 2000
Applicant:
Inventors:

Raman Chandrasekar, Redmond, WA (US);

David W. Steinkraus, Redmond, WA (US);

Assignee:

Microsoft Corporation, Redmond, WA (US);

Attorney:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06F 1/730 ;
U.S. Cl.
CPC ...
G06F 1/730 ;
Abstract

Text classification has become an important aspect of information technology. Present text classification techniques range from simple text matching to more complex clustering methods. Clustering describes a process of discovering structure in a collection of characters. The invention automatically analyzes a text string and either updates an existing cluster or creates a new cluster. To that end, the invention may use a character n-gram matching process in addition to other heuristic-based clustering techniques. In the character n-gram matching process, each text string is first normalized using several heuristics. It is then divided into a set of overlapping character n-grams, where n is the number of adjacent characters. If the commonality between the text string and the existing cluster members satisfies a pre-defined threshold, the text string is added to the cluster. If, on the other hand, the commonality does not satisfy the pre-defined threshold, a new cluster may be created. Each cluster may have a selected topic name. The topic name allows whole clusters to be compared in a similar way to the individual clusters, and merged when a predetermined level of commonality exists between the subject clusters. The topic name also may be used as a suggested alternative to the text string. In this instance, the topic name of the cluster to which the text string was added may be outputted as an alternative to the text string.


Find Patent Forward Citations

Loading…