The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Feb. 01, 2000

Filed:

Nov. 14, 1997
Applicant:
Inventors:

James V Mahoney, Los Angeles, CA (US);

William J Rucklidge, Mountain View, CA (US);

Assignee:

Xerox Corporation, Stamford, CT (US);

Attorney:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06T / ; G05B / ;
U.S. Cl.
CPC ...
358-114 ; 358-12 ; 358-11 ;
Abstract

A method and apparatus for compressing a corpus of document images into a collective tokenized representation. Initially, documents in the corpus are individually compressed into a document tokenized format. A document image in the document tokenized format is represented using a symbol table and a table of positions. Each symbol in the symbol table is a shape in the original document image. The positions in the table of positions indicates where the symbols in the symbol table are placed to form the document image. Subsequently, the individual symbol tables of each document in the corpus are assembled to form clusters of similar shapes. These clusters are then analyzed to identify the degree of interrelationship between the symbols in the individual symbol tables. Individual document symbol tables with a large number of recurring symbols are grouped together. For each of the groups of symbol tables, a collective symbol table is computed. The collective symbol table improves the compression ratio of a corpus by eliminating redundant shapes appearing in the individual document symbol tables. Also, the collective symbol table advantageously identifies groupings of documents in the corpus which are related because a significant number of similar shapes are used in each of the documents.


Find Patent Forward Citations

Loading…