The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Nov. 23, 1999

Filed:

Apr. 22, 1998
Applicant:
Inventor:

Richard Allen Shaner, Seabrook, MD (US);

Attorney:
Primary Examiner:
Assistant Examiner:
Int. Cl.
CPC ...
G06F / ;
U.S. Cl.
CPC ...
704-9 ; 704-1 ; 707530 ;
Abstract

A method of identifying the types of data contained in an electronic file of unknown data type by gathering exemplary files of each data type of interest; counting the number of unique n-grams within each exemplary file; determining a weight for each unique n-gram; listing the unique n-grams in the exemplary files of a particular data type by descending magnitude of weight for each data type of interest; selecting the top m weighted n-grams and their associated weights; establishing a threshold for each data type of interest; selecting a length of data from the electronic file; listing every n-gram in the data selected; giving each listed n-gram, that was also selected, the weight that that n-gram was given for each data type of interest; summing the weights given to each n-gram according to data type; comparing the sums to the thresholds established in order to determine the types, if any, of the selected data; recording the location of the selected data if it is of a data type of interest; stopping if the number of selected lengths of data reached a user-definable number, otherwise selecting another length of data from the file that is the same length as that selected previously, where the newly selected data overlaps with the previously selected data by at least one position; and repeating the steps from listing every n-gram to stopping using the newly selected data.


Find Patent Forward Citations

Loading…