The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Patent No.:

US 6542635 B1

Date of Patent:

Apr. 01, 2003

Filed:

Sep. 08, 1999

Method for document comparison and classification using document image layout

Applicant:

Inventors:

Jianying Hu, Cranford, NJ (US);

Ramanujan S. Kashi, Bridgewater, NJ (US);

Gordon Thomas Wilfong, Gillette, NJ (US);

Assignee:

Lucent Technologies Inc., Murray Hill, NJ (US);

Attorney:

Primary Examiner:

Phuoc Tran

Assistant Examiner:

Amir Alavi

Int. Cl.

CPC ...

G06K 9/34 ; G06K 9/46 ; G06K 9/48 ; G06K 9/62 ; G06K 9/68 ;

U.S. Cl.

CPC ...

G06K 9/34 ; G06K 9/46 ; G06K 9/48 ; G06K 9/62 ; G06K 9/68 ;

Abstract

Document type comparison and classification using layout classification is accomplished by first segmenting a document page into blocks of text and white space. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length interval vector, to represent each row on the segmented document. By computing the Manhattan distance between interval vectors of all rows of two document pages and performing a warping function to determine the row to row correspondence, two documents may be compared by their layout. Furthermore, interval vectors may be grouped into N clusters with a cluster center, defined as the median of the interval vectors of the cluster, replacing each interval vector in its cluster. Using Hidden Markov Models, documents can be compared to document type models comprising rows represented by cluster centers and identified as belonging to one or more document types. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets without requiring expensive OCR of the document. Furthermore, based on the classification, it is a simple matter to locate which blocks of data contain certain information. Where only that information is desired, it is not necessary to perform OCR on the entire document. Rather OCR may be limited to those blocks where the particular information is expected based on the document type.

Find Patent Forward Citations