How to achieve cross-host cache coherence using CXL 3.0 HDM-DB
Updated: Nov 15th, 2023
With the support of Back-Invalidate in CXL 3.0 , a CXL 3.0 Memory Device can invalidate cachelines in the Host Cache. To ensure cache coherence, the memory device should have a device coherency engine (DCOH) and implement an inclusive snoop filter within it. This inclusive snoop filter will store information about cachelines stored in the host cache, including their state (A/S/I) and their owner/sharers.
Inclusive Snoop Filter in DCOH
The snoop filter in the DCOH is a directory-like structure that maintains a 2-bit state, allowing it to encode up to 4 states.
“Inclusive” means that when a cache line is in the host cache, there is also an entry in the inclusive snoop filter indicating its status. If a host wants to place a cache line into its own cache and there are no entries for this line in the snoop filter(SF miss), it needs to allocate an entry in the snoop filter (SF alloc) so that other hosts can know the status of the line from the snoop filter. Therefore, the size of the snoop filter is one of the key aspects in designing DCOH. However, since the size of the directory for snoop filters is limited, when it becomes full, we need to find a victim entry (SF victim), invalidate its corresponding cache line, and make space for placing new entries. This is what Figure 3-20 illustrates.
ASI Protocol ?
We often refer the cache coherence protocol with MSI, MESI, MOESI, MESIF, but never use A state. Why CXL 3.0 use A state to merge the M and E state?
In my opinion, in the host side, host still use traditional protocol to insure the cache coherence, such as M/E state. It is host’s responsibility to write the dirty data back to the share memory. However, in the snoop filter , it should not distinguish the E state and M state :
- the snoop filter need to record which host has the ability to modify
the specific cacheline. When one host get the ownership of a cacheline,
the snoop filter will record this cacheline as A state and ensure that
the host can manipulate this cacheline without requiring further cache
coherence transactions. Only one cache coherence transaction. Even when
the host drop the cacheline, it also do not need cache coherence
- However, if the snoop filter need to record M/S state, each modification need two cache coherence transactions. (I->E and E->M).
- It is host’s responsibility to write the dirty data back to the share memory. So host need to distinguish the E state and M state to only write the dirty data back and reduce overhead. However, no matter it is E or M state, the DCOH always need to query the host or invalidate the cacheline to ensure the cache coherence when other host want to write or read this cacheline. So record E/M state in the snoop filter do not get performance improvement.
Interaction between Host and Memory Device
3.3.2 in CXL 3.0 specification
Format of Request
When we want to read a cacheline , we need to send a M2S Request. The format of M2S Request is seen below:
The most important field is MetaField and MetaValue.
- The MemOpCode field determines the desired operation for this
- The available options include MemRd, MemRdData, MemInv, and others as listed in Table 3-29.
- MetaField have 2 bits so that we can encode 4 kinds of metadata for
difference purpose. (Current only use 00b and 11b)
- And we will modify the state of the cacheline in snoop filter to the
Value of the Meta0-State.
For example, If we send a MemRd request with MetaField(00b) and MetaValue(11b), that says we want to read that cacheline and make “Shared” as the state of this cacheline in the snoop filter . Another example is , If we send a MemRd request with MetaField(00b) and MetaValue(10b), we will read the cacheline and get the ownership of this cacheline, the state of this cacheline in snoopfilter will become “A”.
The detail of interaction
In “Appendix C Memory Protocol Tables” in CXL 3.0 specification, it describes the interaction of the A/S/I with a table(Table C-2).
TL;DR: it depends. I don’t know.
3.3.3 in CXL Specification
Detailed performance implications of the implementation of an Inclusive Snoop Filter are beyond the scope of the specification, but high-level considerations are captured here:
• The number of cachelines that are tracked in an Inclusive Snoop Filter is determined based on host-processor caching of the address space. This is a function of the use model and the cache size in the host processor with upsizing of 4x or more. The 4x is based on an imprecise estimation of the unknowns in future host implementations and mismatch in Host cache ways/sectors as compared to Snoop-Filter ways/sectors.
• Device should have the capability to process Snoop Filter capacity evictions without immediately blocking the new M2S Req channel, and should ensure that blocking of the M2S Requests is a rare event.
• The state per cacheline could be implemented as 2 states or 3 states. For 2 states, it would track the host in I vs. A, where A-state would represent “Any” possible MESI state in the host. For 3 states, it would add the precision of S-state tracking in which the Host may have at most a shared copy of the cacheline.”
- How we deal with A state ?
- I found a similar A state here (but the meaning of A is totally
- In the “Stale AtoS” BIOS option
The in-memory directory has three states: I, A, and S. I (invalid) state means the data is clean and does not exist in any other socket’s cache. The A (snoopAll) state means the data may exist in another socket in exclusive or modified state. S (Shared) state means the data is clean and may be shared across one or more socket’s caches. When doing a read to memory, if the directory line is in the A state we must snoop all the other sockets because another socket may have the line in modified state. If this is the case, the snoop will return the modified data. However, it may be the case that a line is read in A state and all the snoops come back a miss. This can happen if another socket read the line earlier and then silently dropped it from its cache without modifying it. If Stale AtoS feature is enabled, in the situation where a line in A state returns only snoop misses, the line will transition to S state. That way, subsequent reads to the line will encounter it in S state and not have to snoop, saving latency and snoop bandwidth. Stale AtoS may be beneficial in a workload where there are many cross-socket reads.
- I found a similar A state here (but the meaning of A is totally different.) :
- the size of snoop filter ?
Related work about cache coherence
Notes mentioning this note
There are no notes linking to this note.