The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Jan. 22, 2019
Filed:
Jan. 09, 2015
Beijing Jingdong Shangke Information Technology Co, Ltd., Haidian District, Beijing, CN;
Beijing Jingdong Century Trading Co., Ltd., Beijing, CN;
Yaohua Liao, Beijing, CN;
Xiaowei Li, Beijing, CN;
BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD., Beijing, CN;
BEIJING JINGDONG CENTURY TRADING CO., LTD., Beijing, CN;
Abstract
A method and a system for scheduling web crawlers according to keyword search. The method comprises: a scheduling end receiving a task request command sent by a crawling node; the scheduling end acquiring a secondary download link address from a priority bucket, generating tasks, adding the generated tasks into a task list, acquiring keyword link addresses from a dynamic bucket, deriving derivative link addresses of the quantities of pages corresponding to the keyword link addresses, generating tasks of the quantities of the pages according to the derivative link addresses of the quantities of the pages, adding the tasks of the quantities of the pages into the task list, acquiring a keyword link address from a basic bucket, generating tasks, adding the generated tasks into the task list, and the scheduling end returning the task list to the crawling node. By adjusting the quantities of the tasks allowed to be added from a virtual bucket, the quantities of scheduled link addresses of different types are flexibly adjusted. In addition, by crawling popular keywords more frequently, data miss is prevented, and repeated crawls of unpopular keywords is reduced.