Considering just a single core, we have 3 caches to consider. We have the L1 I cache, the L1 D cache, and the L2 "unified" cache. Of course we also have main memory.
When we add additional cores, they each have a private L1 I and D cache, but share the L2 "unified" cache with other cores.
Some systems (such as the 8 core Fire3 I have) have "clusters", typically of 4 cores, and each cluster has its own L2 unified cache.
Note that there is no "automatic" mechanism to ensure coherence between main memory and a D cache. This is expected in the case of device DMA. What may be surprising is that if a core has its D cache disabled and writes to main memory, the D cache of another core (with the D cache enabled) will not be formed. This sort of thing is exactly the same as DMA writing to memory as far as the cache system is concerned.
A variety of issues arise with self modifying code. I avoid this in my own programming, but debugging tools may modify code "on the fly" and require careful attention. However, relocating code will need attention because writes will have gone through the D cache and the I cache may become incoherent.
ARM documentation avoids the term "flush" (for whatever reason) and defines the following operations:
In the other direction, when a packet has been received and deposited in main memory by DMA, I do an invalidate. It is possible to specify an address range for each action, so the entire cache does not need to get involved.
I don't believe I have ever had use for "clean and invalidate".
PoC is typically main memory (unless external caches exist, which is not the case with any systems I work with). It is defined as the point at which all accesses see the same copy.
PoU can (and usually is) more local. It is the point at which the I and D caches of a certain single core see the same copy. This will typically be the L2 cache in systems which have one (all my systems have L2 cache).
This typically distinguishes between L1 (inner) and L2 along with L3 (when it exists) as outer. However there is more to it. We actually have 4 classifications:
Fortunately my multi-cluster systems are all aarch64 based and I won't have to deal with this muddled terminology to deal with them. I can just view inner as L1 and outer as L2. Where I have run into this is when setting up MMU entries and I am being allowed to designate pages as being in one of these 4 classes.
This explanation is admittedly incomplete, as is my understanding at this point.
Tom's electronics pages / tom@mmto.org