芯片设计 » History » Revision 2
Revision 1 (jun chen, 08/14/2025 03:30 PM) → Revision 2/3 (jun chen, 08/14/2025 04:50 PM)
# 芯片设计 : | |Goal |NoGoal | |--|--|--| | GPGPU 目标 | high-performance parallel computing, Rendering, Raytracing | | | Standards | OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV | RISC-V , OpenCL 2 | | Performance | Performance | area | | | | | ## data model dem (data element) 是GPGPU处理的最小单位,一个dem可以是 8, 16,32, 64 bits. dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成 dem array: 由m个dem array 组成, 一个array可能是 8 ~ 256 bit,有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数, m 是vector个数 dem fiber: 按照列进行区分?在dem16 两列 dem array场景下,low 16bits is low dem, high 16 bits is high dem single thread 处理一个 dem vector, 一个dem vector包含多个 dems 在gpgpu 处理时,数据(matrix/vector) 先存储到 TLRs,再计算。dem32 以下占一个register。dem64占两个register。 因此一个 m行n列的 dem array,无论用 dem8, dem16, dem32,都会占满一个reg 一行 ## data process hierarchy: ``` GPU 系统包含多个 "计算节点 (nodes) -> 对应 node-wide thread group (NTG), 虚拟node" 每个计算节点包含多个 "device -> 对应 device-wide thread group (DTG), 虚拟device" 每个device包含多个 "very wide compute units (VCUs) -> 对应 VCU wide thread group (VTG)" 每个VCU 包含多个 "wide compute units (WCUs) -> 对应 wide thread group (WTG)" 每个WCU包含多个 "basic compute units (CUs) -> 对应 thread group (TG)" 每个CU 包含多个 "execution units (EUs) -> 对应一个wrap" 每个EU 包含多个 "processing lane -> 对应一个thread,处理一个 dem vector" 对应一个thread" ``` ## data memory model register file level: ``` Thread-local registers (TLR) Closed segment regisers (x0) Staging registers (g0-g15) Wrap scalar registers (WSR) Constant scalar regisers (CSR) Group shared memory (GSM) GEMM input buffer (GIB) GEMM main buffer (GMB) GEMM reduction buffer (GRB) ``` cache level ``` L1 cache L2 cache ``` HBM level ``` buffer memory objects image memory objects matrix memory objects global memory buffers (GLM) constant buffers (CB) Thread-local memory (TLM) ```