The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jun. 09, 2015

Filed:

Jun. 13, 2012
Applicants:

Boyang Cai, Hangzhou, CN;

Qi Qiang, Hangzhou, CN;

Inventors:

Boyang Cai, Hangzhou, CN;

Qi Qiang, Hangzhou, CN;

Assignee:

Alibaba Group Holding Limited, Grand Cayman, KY;

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06F 17/00 (2006.01); G06F 17/30 (2006.01); G06F 17/21 (2006.01);
U.S. Cl.
CPC ...
G06F 17/30908 (2013.01); G06F 17/211 (2013.01);
Abstract

A method of extracting web page information includes analyzing a document object model (DOM) structure of a sample page to obtain a position of information to be extracted. A node corresponding to the position of the information to be extracted is rendered in the DOM structure as a target node. Starting from the target node, relative position information is traversed recursively until the root node is found to create candidate paths. The candidate paths are rendered as a path set. A DOM structure of a page to be extracted is analyzed, information is located in the DOM structure of the page starting from the root node in the path set, and an extracted node candidate set is obtained. A node having highest robustness from the extracted node candidate set is selected to be a final extracted node and extracted information is obtained using the extracted node.


Find Patent Forward Citations

Loading…