Version 3 - History - 芯片设计 - AI_ML - Redmine

芯片设计 » History » Version 3

jun chen, 08/14/2025 09:38 PM

-jun chen
+# 芯片设计
-jun chen
+{{toc}}
 # spec
-jun chen
+|  |Goal |NoGoal |
 |--|--|--|
 | GPGPU 目标 | high-performance parallel computing, Rendering, Raytracing |  |
 | Standards | OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV | RISC-V , OpenCL 2  |
 | Performance | Performance | area  |
 |  |  |  |
-jun chen
+# data
 ## 数字表示
 **bf16, fp16, bf20?**
 fp4 (用于低精度训练）
 https://mp.weixin.qq.com/s?__biz=Mzg2NzQ5Mzc5OA==&mid=2247488906&idx=1&sn=9d9d82388e5e0416ebe830feca38ab34&chksm=cf2b8c5a7f460730e32e10c24c08fcd341ad2cfcec6d4b43055db3a4f0321a6414185a0517b9&mpshare=1&scene=1&srcid=08138wKNQeeCkCgRIiW8Hr0j&sharer_shareinfo=183d24dc5d9c11986c8672ec7402b507&sharer_shareinfo_first=bef240b79088637f34baf6bb3c5e8870#rd
-jun chen
+## data model
 dem (data element) 是GPGPU处理的最小单位，一个dem可以是 8, 16,32, 64 bits.
 dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成
 dem array:  由m个dem array 组成， 一个array可能是 8 ~ 256 bit，有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数， m 是vector个数
 dem fiber:  按照列进行区分？在dem16 两列 dem array场景下，low 16bits is low dem, high 16 bits is high dem
 single thread 处理一个 dem vector, 一个dem vector包含多个 dems
 在gpgpu 处理时，数据(matrix/vector) 先存储到 TLRs，再计算。dem32 以下占一个register。dem64占两个register。
 因此一个 m行n列的 dem array，无论用 dem8, dem16, dem32，都会占满一个reg 一行
 ## data process hierarchy:
 ```
-jun chen
+GPU 系统包含多个     "计算节点 (nodes)  -> 对应 node-wide thread group (NTG), 虚拟node"
 每个计算节点包含多个 "device  ->  对应 device-wide thread group （DTG), 虚拟device"
 每个device包含多个   "very wide compute units (VCUs)  -> 对应 VCU wide thread group (VTG)"
 每个VCU 包含多个     "wide compute units (WCUs) -> 对应 wide thread group (WTG)"
 每个WCU包含多个      "basic compute units (CUs)  -> 对应 thread group (TG)"
 每个CU 包含多个      "execution units (EUs)  -> 对应一个wrap"
 每个EU 包含多个      "processing lane  ->  对应一个thread,处理一个 dem vector"
 ```
 jun chen
 ## data memory model
-jun chen
+### register file level:
 jun chen
-jun chen
+每个register可以包含dem32 或两个 dem16。
 immediate number 包含 0~10, 1/2 ~ 10/1, pi,等，目前共计32个
 Mask registers: 每个thread用 1bit m0, m1 作为activate/deactivate 状态标志位
 控制寄存器
-jun chen
+```
-jun chen
+个 m0/m1 组合成 wrap mask register wm0, wm1. thread 启动时 m0 m1 置为1，wrap启动时置为0
 Descriptor offser register (DORs), 描述符寄存器被instruction schedular 操作
 Warp-scalar descriptor register (WSDRs) 用于 存储 resource descriptor
-jun chen
+```
-jun chen
+数据data 寄存器
 ```
 Thread-local registers (TLR)  (32bit, r0-r511，通过 indirect address 访问)
 Closed segment regisers (x0)  （queue of four 32bit entries per thread, 给wrap运行期间独享的 temp 寄存器）
 Staging registers (g0-g15)     (存储 LSA 的操作数，在g0-g15 ready 后， LSA 指令才会执行)
 Wrap scalar registers (WSR)   (32bit, q0-q63, indirect address) 注意，数据在 WSR 内部以bf20 方式存储。进出转换 bf16/fp16
 Constant scalar regisers (CSR) (32bit, c0-c1023) 注意，数据在 CSR 内部以 bf20 方式存储。
 ```
-jun chen
+cache level
 ```
 L1 cache
 L2 cache
 ```
-jun chen
+HBM level (Memory level, can be accessed by program)
 分为三类： buffers (TLM, GSM, GLM, CB, and MBUF) 通过 byte address 寻址
 images. 通过坐标寻址
 memory 通过descriptor 描述符访问 (memory object)
-jun chen
+```
-jun chen
+buffer memory objects （MBUF)
 image memory objects   存储  1D image (M1D), 1D image arrays (M1DA), M2D, M2DA, M3D
-jun chen
+matrix memory objects
-jun chen
+global memory buffers (GBM)
 constant buffers (CB)        constant，只读
 Thread-local memory (TLM)   作为thread的local memory独占使用
 Group shared memory (GSM)      （每个 thread group 独占访问）
 GEMM input buffer (GIB)
 GEMM main buffer (GMB)
 GEMM reduction buffer (GRB)
 ```
 # instruction
 指令是64 或 128 bit
 ## wrap scalar 指令
 CT (control transfer), LSA (load-store-atomic) including fence/flush/invalidate,
 wrap scalar 指令： 64 bit, 0~31bit 是标志位，opcode, （destination register index),src1,src2,32~63 bit 是immediate number
 其中 opcode 的ALU 指令包含 scalar nop, move, add,... and, move, 原子计算等，目前有0~31， 96， 97 种
 其中 opcode 的CT  指令包含 JUMP, CALL, RET, BNZ, END 等，目前有32~63种
 其中 opcode 的LSA 指令包含 ACK, FLUSH, Invalidate cache, config 等64~82 种
 ## wrap SIMT 指令
 ALU , LSA, SFU (special function unit)
 指令通过mask操作，和thread关联。thread 改变mask才能执行，否则不执行
 ### SPU operation 操作
 进行单操作数的数学计算，例如 sine, cosine, log2, e^2, sigmoid, etc.
 ### ALU 指令执行完后，数据写回有如下可能路径：
 ```
 TLR/x0/WSR/WMx ( fused = 0, staging = 0，即不写入staging缓冲区，也不融合传递）
 SFU (fused = 1, staging = 0, 不写入缓冲区，融合计算）
 LSA/TEX/MAT (staging =1, shader stagingcnt>0, 写出mem）
 FB( frame buffer), staging = 1, shader = FS(fragment shader)
 VB( vertex buffer), staging= 1, shader !=FS
-jun chen
+```

Project

General

Profile

AI_ML

芯片设计 » History » Version 3