The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Mar. 28, 2017

Filed:

Dec. 17, 2014
Applicant:

Amazon Technologies, Inc., Seattle, WA (US);

Inventors:

Sivaranjini Dharmalingam, Seattle, WA (US);

Nathan Thomas Close, Seattle, WA (US);

Shantanu Shailendrakumar Fauji, Seattle, WA (US);

Sean Gwizdak, Edmonds, WA (US);

Jiahui Jiang, Urbana, IL (US);

Yohan Mammen, Renton, WA (US);

Roshan Rammohan, Seattle, WA (US);

Assignee:

Amazon Technologies, Inc., Seattle, WA (US);

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06F 17/30 (2006.01);
U.S. Cl.
CPC ...
G06F 17/30324 (2013.01);
Abstract

Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.


Find Patent Forward Citations

Loading…