The patent badge is an abbreviated version of the USPTO patent document. The patent badge does contain a link to the full patent document.
The patent badge is an abbreviated version of the USPTO patent document. The patent badge covers the following: Patent number, Date patent was issued, Date patent was filed, Title of the patent, Applicant, Inventor, Assignee, Attorney firm, Primary examiner, Assistant examiner, CPCs, and Abstract. The patent badge does contain a link to the full patent document (in Adobe Acrobat format, aka pdf). To download or print any patent click here.
Patent No.:
Date of Patent:
Jun. 17, 2025
Filed:
Mar. 10, 2022
Nvidia Corporation, Santa Clara, CA (US);
Greg Palmer, Cedar Park, TX (US);
Gentaro Hirota, San Jose, CA (US);
Ronny Krashinsky, Portola Valley, CA (US);
Ze Long, San Jose, CA (US);
Brian Pharris, Cary, NC (US);
Rajballav Dash, San Jose, CA (US);
Jeff Tuckey, Saratoga, CA (US);
Jerome F. Duluk, Jr., Palo Alto, CA (US);
Lacky Shah, Los Altos Hills, CA (US);
Luke Durant, San Jose, CA (US);
Jack Choquette, Palo Alto, CA (US);
Eric Werness, San Jose, CA (US);
Naman Govil, Sunnyvale, CA (US);
Manan Patel, San Jose, CA (US);
Shayani Deb, Seattle, WA (US);
Sandeep Navada, San Jose, CA (US);
John Edmondson, Arlington, MA (US);
Prakash Bangalore Prabhakar, San Jose, CA (US);
Wish Gandhi, Sunnyvale, CA (US);
Ravi Manyam, San Ramon, CA (US);
Apoorv Parle, San Jose, CA (US);
Olivier Giroux, Santa Clara, CA (US);
Shirish Gadre, Fremont, CA (US);
Steve Heinrich, Madison, AL (US);
NVIDIA Corporation, Santa Clara, CA (US);
Abstract
A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.