Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers
atomicAdd(x,2)
操作atomicAdd(x,5)
操作atomicAdd(x,7)
)。mem_order_comm
的原子操作会绕过 LABloc = arr[tid];
// atomicAdd(&hist[loc], 1);
atomicAdd(&hist[loc], 1, mem_order_comm);
GPU | Feature Configuration (Size, Access Latency) |
---|---|
SMs | 80 |
# Registers / SM | 64 KB |
LI Instruction Cache / SM | 128 KB |
LI Data Cache / SM | 32 KB (max 128 KB), 28 cycles |
GPU | Feature Configuration (Size, Access Latency) |
---|---|
L2 Cache | 4.6 MB, 148 cycles |
MSHR | 256 (L1) and 192 (L2) Entries |
Shared Memory Size / SM | 96 KB (max 128 KB), 19 cycles |
Memory | 16 GB HBM2, 248 cycles |