Actions
芯片设计 » History » Revision 2
« Previous |
Revision 2/3
(diff)
| Next »
jun chen, 08/14/2025 04:50 PM
芯片设计¶
:
Goal | NoGoal | |
---|---|---|
GPGPU 目标 | high-performance parallel computing, Rendering, Raytracing | |
Standards | OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV | RISC-V , OpenCL 2 |
Performance | Performance | area |
data model¶
dem (data element) 是GPGPU处理的最小单位,一个dem可以是 8, 16,32, 64 bits.
dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成
dem array: 由m个dem array 组成, 一个array可能是 8 ~ 256 bit,有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数, m 是vector个数
dem fiber: 按照列进行区分?在dem16 两列 dem array场景下,low 16bits is low dem, high 16 bits is high dem
single thread 处理一个 dem vector, 一个dem vector包含多个 dems
在gpgpu 处理时,数据(matrix/vector) 先存储到 TLRs,再计算。dem32 以下占一个register。dem64占两个register。
因此一个 m行n列的 dem array,无论用 dem8, dem16, dem32,都会占满一个reg 一行
data process hierarchy:¶
GPU 系统包含多个 "计算节点 (nodes) -> 对应 node-wide thread group (NTG), 虚拟node"
每个计算节点包含多个 "device -> 对应 device-wide thread group (DTG), 虚拟device"
每个device包含多个 "very wide compute units (VCUs) -> 对应 VCU wide thread group (VTG)"
每个VCU 包含多个 "wide compute units (WCUs) -> 对应 wide thread group (WTG)"
每个WCU包含多个 "basic compute units (CUs) -> 对应 thread group (TG)"
每个CU 包含多个 "execution units (EUs) -> 对应一个wrap"
每个EU 包含多个 "processing lane -> 对应一个thread,处理一个 dem vector"
data memory model¶
register file level:
Thread-local registers (TLR)
Closed segment regisers (x0)
Staging registers (g0-g15)
Wrap scalar registers (WSR)
Constant scalar regisers (CSR)
Group shared memory (GSM)
GEMM input buffer (GIB)
GEMM main buffer (GMB)
GEMM reduction buffer (GRB)
cache level
L1 cache
L2 cache
HBM level
buffer memory objects
image memory objects
matrix memory objects
global memory buffers (GLM)
constant buffers (CB)
Thread-local memory (TLM)
Updated by jun chen 7 days ago · 3 revisions