The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.

The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.

Date of Patent:
Jun. 30, 2020

Filed:

Apr. 18, 2018
Applicant:

Emc Ip Holding Company Llc, Hopkinton, MA (US);

Inventors:

Junping Zhao, Beijing, CN;

Dragan Savic, Brookline, MA (US);

Assignee:

EMC IP Holding Company LLC, Hopkinton, MA (US);

Attorney:
Primary Examiner:
Int. Cl.
CPC ...
G06F 11/00 (2006.01); G06F 11/14 (2006.01); G06T 1/20 (2006.01); G06K 9/62 (2006.01); G06F 9/46 (2006.01); G06T 1/60 (2006.01); G06F 9/48 (2006.01); G06N 20/00 (2019.01);
U.S. Cl.
CPC ...
G06F 11/1407 (2013.01); G06F 9/461 (2013.01); G06F 9/4881 (2013.01); G06K 9/6257 (2013.01); G06N 20/00 (2019.01); G06T 1/20 (2013.01); G06T 1/60 (2013.01);
Abstract

Systems and methods are provided to optimize checkpoint operations for deep learning (DL) model training tasks. For example, a distributed DL model training process is executed to train a DL model using multiple accelerator devices residing on one or more server nodes, and a checkpoint operation is performed to generate and store a checkpoint of an intermediate DL model. A checkpoint operation includes compressing a checkpoint of an intermediate DL model stored in memory of a given accelerator device to generate a compressed checkpoint, and scheduling a time to perform a memory copy operation to transfer a copy of the compressed checkpoint from the memory of the given accelerator device to a host system memory. The scheduling is performed based on information regarding bandwidth usage of a communication link to be utilized to transfer the compressed checkpoint to perform the memory copy operation, wherein the memory copy operation is performed at the scheduled time.


Find Patent Forward Citations

Loading…