Some materials about Zen3, AMD EPYC 7763 and 7773x

Updated: Oct 25th, 2023

CCD in Zen3 CPU

Like Zen 2, Zen 3 is composed of up to 2 core complex dies (CCD) along with a separate IO die containing the I/O components. A Zen 3 CCD is composed of a single core complex (CCX) containing 8 CPU cores and 32 MB of shared L3 cache, this is in contrast to Zen 2 where each CCD is composed of 2 CCX, each containing 4 cores each as well as 16 MB of L3 cache. The new configuration allows all 8 cores of the CCX to directly communicate with each other and the L3 Cache instead of having to use the IO die through the Infinity Fabric. ref

Zen 3 - Wikipedia

no title picture

Zen 3 - Microarchitectures - AMD - WikiChip

source

This slide indicates that the 32MB L3 cache in the CCD can be accessed by all 8 cores. Therefore, if we need to share 8MB of data among these cores, we only need to utilize 8MB of space in the L3 cache in Zen 3 CCD. The remaining 24MB can then be allocated for per-core data caching. In contrast, the Zen 2 layout requires each CCX to have its own dedicated 8MB data storage, resulting in a reduced capacity of only 16MB for per-core data caching.

Digital Launch of 3rd Gen AMD EPYC™ Processors :: Advanced Micro Devices, Inc. (AMD)

Cache

for each CCD in 7763

L1i: 32 KiB per core

L1d: 16 KiB per core

L2: 512 KiB per core

L3: 32MB per CCD

Zen 3 - Microarchitectures - AMD - WikiChip

AMD EPYC 7763 Specs | TechPowerUp CPU Database

Architecture view

A long look at AMD’s Zen 3 core and chips - SemiAccurate

EPYC 7763

Package View (source)

The package view of 7763 is above. It have one big I/O die and 8 CCDs around it.

(source)

core to core latency

Pasted image 20230904144706 Topology, Memory Subsystem & Latency - AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balance

Memory Subsystem Topology

NPS 4

source

(source)

EPYC 7763 - 2 Socket

For two-socket system, the communication between EPYC 7763 is through 4 xGMI-2(Infinity Fabric Gen 2) links.

(source)

Review: AMD Epyc 7763 2P (Milan) - CPU - HEXUS.net - Page 2

Cross socket interconnect for 7763

4-Link xGMI-2
Max Speed : 18GT/s
Maximum Link Width : x16

Theoretical maximum bandwidth for cross-socket communication. :

\(4 * 18GT/s * 16bits/transfer * (1/8) bytes/bit = 144 GB/s\)

\(144GB/s * 60\% = 86.4 GB/s\)

How to calculate ? refer to AMD Rome Processors - HECC Knowledge Base

For EPYC 7742

In each Rome node configured with the HPE Apollo 9000 system architecture, there are 3 xGMI links using a total of 48 PCIe lanes. With the xGMI link speed set at 16 GT/s, the theoretical throughput for each direction is 96 GB/s (3 links x 16 GT/s x 2 bytes/transfer) without factoring in the encoding for xGMI, since there is no publication from AMD available. However, the expected efficiencies are 66–75%, so the sustained bandwidth per direction will be 63.5–72 GB/s.

some references:

source

EPYC 7773x

The 7773x is similar to the 7763, except for the difference in L3 cache size.

96MB L3 cache per CCD. (32MB + 64MB 3D V-Cache)

AMD EPYC 7773X Specs | TechPowerUp CPU Database

AMD “Zen 3” 3D Vertical Cache Detailed Some More | TechPowerUp

Deep Diving Zen 3 V-Cache – Chips and Cheese

Some test result

AMD 7773x NPS2 mode

Pasted image 20230601161328

AMD 7773x NPS4 mode

$ cat nps4.log 
Intel(R) Memory Latency Checker - v3.10
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring idle latencies for random access (in ns)...
        Numa node
Numa node        0       1       2       3       4       5       6       7  
       0      84.7    93.4   101.8   104.6   172.8   177.4   171.7   172.3  
       1      92.8    85.7   104.7   100.5   179.0   183.0   176.7   178.0  
       2     101.0   104.8    85.9    92.6   173.3   179.9   175.6   176.1  
       3     104.4   101.6    93.3    84.5   177.1   183.4   182.1   183.2  
       4     173.3   177.5   175.6   176.6    86.1    93.3   101.7   104.6  
       5     179.3   183.2   181.2   181.2    92.6    85.7   104.7   100.5  
       6     169.2   174.6   175.6   177.7   100.5   104.7    85.7    92.6  
       7     175.9   180.7   181.2   187.0   104.4   101.6    93.3    86.1  

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :  356352.3    
3:1 Reads-Writes :  322368.2    
2:1 Reads-Writes :  318636.4    
1:1 Reads-Writes :  310333.6    
Stream-triad like:  323440.8    

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node        0       1       2       3       4       5       6       7  
       0    44538.4 44004.2 43184.1 42758.6 24522.6 23993.0 23872.8 24716.7 
       1    43988.9 44562.7 42721.3 43210.6 24560.3 34338.6 24718.8 24684.9 
       2    43160.7 42725.0 44546.2 44019.6 24520.7 24844.4 24690.3 23863.1 
       3    42704.5 43131.0 43962.0 44506.2 23519.0 23690.0 24464.4 24403.7 
       4    24397.7 24461.7 24317.2 23554.7 44582.6 43972.4 43163.3 42734.7 
       5    24680.3 34208.9 24887.2 24376.8 43964.5 44516.6 42696.0 43200.9 
       6    23491.7 23893.6 24539.3 24437.4 43211.4 42732.6 44517.6 43985.0 
       7    24575.8 23864.0 23453.3 24472.6 42720.5 43142.2 43930.1 44468.4 

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  920.83   356069.6
 00002  924.85   355460.4
 00008  920.38   356391.8
 00015  909.86   356797.4
 00050  891.06   357606.7
 00100  887.14   357646.8
 00200  497.61   357489.2
 00300  143.45   279058.4
 00400  130.96   216713.9
 00500  125.18   177043.2
 00700  119.40   129561.3
 01000  115.38    92492.1
 01300  112.84    71951.3
 01700  104.98    55619.7
 02500  102.25    38317.6
 03500  101.50    27687.9
 05000  100.67    19648.4
 09000  100.28    11244.0
 20000  100.07     5425.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency    23.8
Local Socket L2->L2 HITM latency    27.1
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
            Reader Numa Node
Writer Numa Node     0       1       2       3       4       5       6       7  
            0        -   107.1   113.6   117.4   185.3   191.5   181.2   187.9  
            1    106.4       -   117.7   113.7   189.2   195.4   185.1   191.7  
            2    114.0   117.4       -   106.2   186.9   193.1   186.6   193.2  
            3    117.6   113.7   106.2       -   186.7   193.5   190.7   197.3  
            4    185.9   192.3   186.5   190.7       -   106.8   114.3   118.7  
            5    189.1   195.4   189.7   193.8   106.9       -   117.5   114.4  
            6    182.4   188.8   186.7   194.6   114.5   117.5       -   106.9  
            7    185.3   191.6   189.5   197.4   118.8   114.3   106.9       -  
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
            Reader Numa Node
Writer Numa Node     0       1       2       3       4       5       6       7  
            0        -   107.5   116.9   121.8   188.9   197.8   184.2   194.0  
            1    107.6       -   122.0   116.8   194.0   202.8   189.3   199.0  
            2    117.1   121.8       -   107.4   189.8   201.1   190.1   199.8  
            3    122.0   116.7   107.4       -   190.8   202.8   195.5   205.2  
            4    189.5   197.8   190.6   196.7       -   108.0   117.4   123.1  
            5    193.8   202.2   194.9   201.0   108.1       -   121.9   117.4  
            6    185.3   193.7   190.3   201.3   117.6   121.9       -   108.0  
            7    189.5   197.9   194.4   205.4   123.2   117.5   108.0       -

Reference

some reference data:

Instead of authenticating the giscus application, you can also comment directly on GitHub.

TOC

CCD in Zen3 CPU
- Cache
- Architecture view
EPYC 7763
- core to core latency
- Memory Subsystem Topology
EPYC 7763 - 2 Socket
- xGMI related
EPYC 7773x
Some test result
- AMD 7773x NPS2 mode
- AMD 7773x NPS4 mode
Reference

Notes mentioning this note

There are no notes linking to this note.

Some materials about Zen3, AMD EPYC 7763 and 7773x

CCD in Zen3 CPU

Cache

Architecture view

EPYC 7763

core to core latency

Memory Subsystem Topology

EPYC 7763 - 2 Socket

xGMI related

EPYC 7773x

Some test result

AMD 7773x NPS2 mode

AMD 7773x NPS4 mode

Reference

Notes mentioning this note