The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Apr. 13, 2010

Filed:

Apr. 20, 2007
Applicants:

Amit Sasturkar, Santa Clara, CA (US);

Rajat Ahuja, San Jose, CA (US);

Shanmugasundaram Ravikumar, Berkeley, CA (US);

Vladimir Ofitserov, Foster City, CA (US);

Inventors:

Amit Sasturkar, Santa Clara, CA (US);

Rajat Ahuja, San Jose, CA (US);

Shanmugasundaram Ravikumar, Berkeley, CA (US);

Vladimir Ofitserov, Foster City, CA (US);

Assignee:

Yahoo! Inc., Sunnyvale, CA (US);

Attorney:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06F 17/00 (2006.01);
U.S. Cl.
CPC ...
Abstract

Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.


Find Patent Forward Citations

Loading…