The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Sep. 14, 2021
Filed:
Oct. 31, 2019
Emc Ip Holding Company Llc, Hopkinton, MA (US);
Adriana Bechara Prado, Niterói, BR;
Vitor Silva Sousa, Niterói, BR;
Marcia Lucas Pesce, Rio de Janeiro, BR;
Paulo de Figueiredo Pires, Rio de Janeiro, BR;
Fábio André Machado Porto, Petrópolis, BR;
Altobelli de Brito Mantuan, Niterói, BR;
Rodolpho Rosa da Silva, Rio de Janeiro, BR;
Wagner dos Santos Vieira, Rio de Janeiro, BR;
EMC IP Holding Company LLC, Hopkinton, MA (US);
Abstract
Techniques are provided for data discovery and data integration in a data lake. One method comprises obtaining data files from a data lake, wherein each data file comprises multiple records having multiple fields; selecting multiple candidate fields from a data file based on a record type; determining a relevance score for each candidate field from the data file based on multiple features extracted from the data file; and clustering the scored candidate fields into clusters of similar domains using a hashing algorithm, wherein a given cluster comprises candidate fields, wherein multiple data files can be integrated based on a domain of the candidate fields in the given cluster. The relevance score for each candidate field is based on multiple features comprising, for example, features that take into account a morphological or semantic similarity between file name, file metadata and/or file records and features that consider statistics of candidate fields in a data file.