Project

General

Profile

Actions

芯片设计 » History » Revision 2

« Previous | Revision 2/3 (diff) | Next »
jun chen, 08/14/2025 04:50 PM


芯片设计

Goal NoGoal
GPGPU 目标 high-performance parallel computing, Rendering, Raytracing
Standards OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV RISC-V , OpenCL 2
Performance Performance area

data model

dem (data element) 是GPGPU处理的最小单位,一个dem可以是 8, 16,32, 64 bits.
dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成
dem array: 由m个dem array 组成, 一个array可能是 8 ~ 256 bit,有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数, m 是vector个数
dem fiber: 按照列进行区分?在dem16 两列 dem array场景下,low 16bits is low dem, high 16 bits is high dem

single thread 处理一个 dem vector, 一个dem vector包含多个 dems
在gpgpu 处理时,数据(matrix/vector) 先存储到 TLRs,再计算。dem32 以下占一个register。dem64占两个register。
因此一个 m行n列的 dem array,无论用 dem8, dem16, dem32,都会占满一个reg 一行

data process hierarchy:

GPU 系统包含多个     "计算节点 (nodes)  -> 对应 node-wide thread group (NTG), 虚拟node"
每个计算节点包含多个 "device  ->  对应 device-wide thread group (DTG), 虚拟device"
每个device包含多个   "very wide compute units (VCUs)  -> 对应 VCU wide thread group (VTG)"
每个VCU 包含多个     "wide compute units (WCUs) -> 对应 wide thread group (WTG)"
每个WCU包含多个      "basic compute units (CUs)  -> 对应 thread group (TG)"
每个CU 包含多个      "execution units (EUs)  -> 对应一个wrap"
每个EU 包含多个      "processing lane  ->  对应一个thread,处理一个 dem vector"

data memory model

register file level:

Thread-local registers (TLR)
Closed segment regisers (x0)
Staging registers (g0-g15)
Wrap scalar registers (WSR)
Constant scalar regisers (CSR)
Group shared memory (GSM)
GEMM input buffer (GIB)
GEMM main buffer (GMB)
GEMM reduction buffer (GRB)

cache level

L1 cache
L2 cache

HBM level

buffer memory objects
image memory objects
matrix memory objects
global memory buffers (GLM)
constant buffers (CB)
Thread-local memory (TLM)

Updated by jun chen 7 days ago · 3 revisions