The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jul. 01, 2014

Filed:

Jan. 05, 2010
Applicants:

Ashwin Tengli, Karnataka, IN;

Rajeev Rastogi, Karnataka, IN;

Jeyashankher Ramamirtham, Karnataka, IN;

Srinivasan H Sengamedu, Karnataka, IN;

Sandeepkumar Bhuramal Satpal, Karnataka, IN;

Inventors:

Ashwin Tengli, Karnataka, IN;

Rajeev Rastogi, Karnataka, IN;

Jeyashankher Ramamirtham, Karnataka, IN;

Srinivasan H Sengamedu, Karnataka, IN;

Sandeepkumar Bhuramal Satpal, Karnataka, IN;

Assignee:

Yahoo! Inc., Sunnyvale, CA (US);

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06F 7/00 (2006.01); G06F 17/30 (2006.01);
U.S. Cl.
CPC ...
G06F 17/30854 (2013.01);
Abstract

Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern ('URL-regex') that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.


Find Patent Forward Citations

Loading…