The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Aug. 13, 2024
Filed:
Apr. 22, 2021
Oracle International Corporation, Redwood Shores, CA (US);
Philip Ogren, Boulder, CO (US);
Oracle International Corporation, Redwood Shores, CA (US);
Abstract
A natural language identity classifier system is described, which employs a supervised machine learning (ML) model to perform language identity classification on input text. The ML model takes, as input, non-lexicalized features of target text derived from subword tokenization of the text. Specifically, these non-lexicalized features are generated based on statistics determined for tokens identified for the input text. According to an embodiment, at least some of the non-lexicalized features are based on natural language-specific summary statistics that indicate how often tokens were found within a corpus for each natural language. Use of such summary statistics allows for generation of natural language specific conditional probability-based features. Because of the inherent interpretability of a trained non-lexicalized ML model as described herein, it is possible to modify behavior of the trained ML model by adjusting summary statistics maintained for natural language tokens and/or by adjusting data for the subword tokenizers.