The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jun. 19, 2012

Filed:

May. 18, 2007
Applicants:

Surajit Chaudhuri, Redmond, WA (US);

Venkatesh Ganti, Redmond, WA (US);

Shriraghav Kaushik, Redmond, WA (US);

Anish Das Sarma, Palo Alto, CA (US);

Inventors:

Surajit Chaudhuri, Redmond, WA (US);

Venkatesh Ganti, Redmond, WA (US);

Shriraghav Kaushik, Redmond, WA (US);

Anish Das Sarma, Palo Alto, CA (US);

Assignee:

Microsoft Corporation, Redmond, WA (US);

Attorney:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06F 17/30 (2006.01);
U.S. Cl.
CPC ...
Abstract

A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.


Find Patent Forward Citations

Loading…