The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Nov. 03, 2020

Filed:

Feb. 08, 2019
Applicant:

International Business Machines Corporation, Armonk, NY (US);

Inventors:

Peter Willem Jan Staar, Wadenswil, CH;

Michele Dolfi, Zurich, CH;

Christoph Auer, Zurich, CH;

Aleksandros Sobczyk, Zurich, CH;

Konstantinos Bekas, Horgen, CH;

Attorneys:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06F 17/21 (2006.01); G06F 40/103 (2020.01); G06N 20/00 (2019.01); G06F 40/20 (2020.01); G06F 40/123 (2020.01);
U.S. Cl.
CPC ...
G06F 40/103 (2020.01); G06F 40/123 (2020.01); G06F 40/20 (2020.01); G06N 20/00 (2019.01);
Abstract

A method of collecting training data of a document component may be provided. The documents have a structure and are coded in the typesetting language TeX. The method comprise receiving a TeX source file, compiling it into a PDF file and a related sync file, analyzing the PDF file, thereby determining a non-text-only document component. The method comprises also determining first coordinates of the non-text-only document component and a corresponding page number, determining a typesetting command relating to a non-text-only document component and determining second coordinates of a bounding box and a corresponding page number from the sync file, determining text elements in the non-text-only document component of the PDF file for which the first coordinates and the second coordinates overlap, and combining the determined text elements and linking them to a type of a non-text document component determined in the non-text-only document component in the TeX source file.


Find Patent Forward Citations

Loading…