The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Oct. 09, 2001
Filed:
Nov. 02, 1999
Marc Alexander Najork, Palo Alto, CA (US);
Clark Allan Heydon, San Francisco, CA (US);
Alta Vista Company, Palo Alto, CA (US);
Abstract
A web crawler stores fixed length representations of document addresses in first and second caches and a disk file. When the web crawler downloads a document from a host computer, it identifies URL's (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation is systematically compared to numerical representations in the caches and disk file. If the representation is not found in the caches and disk file, the document corresponding to the representation is scheduled for downloading, and the representation is stored in the second cache. If the representation is not found in the caches but is found in the disk file, the representation is added to the first cache. When the second cache is full, it is merged with the disk file and the second cache is reset to an initial state. When the first cache is full, one or more representations are evicted in accordance with an eviction policy. The representations include a prefix that is a function of a host component of the corresponding URL's, and the representations are stored in the disk file in sorted order. When the web crawler searches for a representation in the disk file, an index of the disk file is searched to identify a single block of the disk file, and only that single block of the disk file is searched for the representation.