The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

G06K 9/00 (2022.01); G06F 16/56 (2019.01); G06F 16/583 (2019.01); G06F 17/27 (2006.01); G06K 9/46 (2006.01); G06V 30/414 (2022.01); G06F 40/284 (2020.01); G06V 10/40 (2022.01); G06V 30/413 (2022.01); G06V 30/416 (2022.01);

U.S. Cl.

CPC ...

G06V 30/414 (2022.01); G06F 16/56 (2019.01); G06F 16/5846 (2019.01); G06F 40/284 (2020.01); G06V 10/40 (2022.01); G06V 30/413 (2022.01); G06V 30/416 (2022.01);

Abstract

Described systems and methods allow the automatic extraction of structured information from images of structured text documents such as invoices and receipts. Some embodiments employ optical character recognition (OCR) technology to extract individual text tokens (e.g., words) and token bounding boxes from a document image. A feature vector of each text token comprises a first part determined according to a character content of the text token, and a second part determined according to an image content of the token's bounding box. A neural network classifier produces a label indicative of a type of information (e.g. 'billing address', 'due date', etc.) carried by each text token. In some embodiments, documents are linearized by ordering text tokens in a sequence according to a reading order of a natural language (e.g., English, Arabic) in which the respective document is formulated. Token feature vectors are fed to the classifier in the order indicated by the token sequence.

Find Patent Forward Citations