The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Dec. 12, 2017

Filed:

May. 12, 2016
Applicant:

International Business Machines Corporation, Armonk, NY (US);

Inventors:

Charles E. Beller, Baltimore, MD (US);

Michael Drzewucki, Chantilly, VA (US);

Christopher Phipps, Arlington, VA (US);

Kristen M. Summers, Takoma Park, MD (US);

Julie T. Yu, Chantilly, VA (US);

Attorneys:
Primary Examiner:
Int. Cl.
CPC ...
G06F 17/00 (2006.01); G06F 17/24 (2006.01); G06F 17/28 (2006.01); G06F 17/30 (2006.01);
U.S. Cl.
CPC ...
G06F 17/241 (2013.01); G06F 17/28 (2013.01); G06F 17/30734 (2013.01);
Abstract

A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.


Find Patent Forward Citations

Loading…