芯片设计 » History » Version 3
jun chen, 08/14/2025 09:38 PM
1 | 1 | jun chen | # 芯片设计 |
---|---|---|---|
2 | |||
3 | 3 | jun chen | {{toc}} |
4 | |||
5 | # spec |
||
6 | |||
7 | 1 | jun chen | | |Goal |NoGoal | |
8 | |--|--|--| |
||
9 | | GPGPU 目标 | high-performance parallel computing, Rendering, Raytracing | | |
||
10 | | Standards | OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV | RISC-V , OpenCL 2 | |
||
11 | | Performance | Performance | area | |
||
12 | | | | | |
||
13 | |||
14 | |||
15 | 3 | jun chen | # data |
16 | |||
17 | ## 数字表示 |
||
18 | |||
19 | **bf16, fp16, bf20?** |
||
20 | fp4 (用于低精度训练) |
||
21 | |||
22 | https://mp.weixin.qq.com/s?__biz=Mzg2NzQ5Mzc5OA==&mid=2247488906&idx=1&sn=9d9d82388e5e0416ebe830feca38ab34&chksm=cf2b8c5a7f460730e32e10c24c08fcd341ad2cfcec6d4b43055db3a4f0321a6414185a0517b9&mpshare=1&scene=1&srcid=08138wKNQeeCkCgRIiW8Hr0j&sharer_shareinfo=183d24dc5d9c11986c8672ec7402b507&sharer_shareinfo_first=bef240b79088637f34baf6bb3c5e8870#rd |
||
23 | |||
24 | 1 | jun chen | ## data model |
25 | |||
26 | dem (data element) 是GPGPU处理的最小单位,一个dem可以是 8, 16,32, 64 bits. |
||
27 | dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成 |
||
28 | dem array: 由m个dem array 组成, 一个array可能是 8 ~ 256 bit,有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数, m 是vector个数 |
||
29 | dem fiber: 按照列进行区分?在dem16 两列 dem array场景下,low 16bits is low dem, high 16 bits is high dem |
||
30 | |||
31 | single thread 处理一个 dem vector, 一个dem vector包含多个 dems |
||
32 | 在gpgpu 处理时,数据(matrix/vector) 先存储到 TLRs,再计算。dem32 以下占一个register。dem64占两个register。 |
||
33 | 因此一个 m行n列的 dem array,无论用 dem8, dem16, dem32,都会占满一个reg 一行 |
||
34 | |||
35 | ## data process hierarchy: |
||
36 | |||
37 | ``` |
||
38 | 2 | jun chen | GPU 系统包含多个 "计算节点 (nodes) -> 对应 node-wide thread group (NTG), 虚拟node" |
39 | 每个计算节点包含多个 "device -> 对应 device-wide thread group (DTG), 虚拟device" |
||
40 | 每个device包含多个 "very wide compute units (VCUs) -> 对应 VCU wide thread group (VTG)" |
||
41 | 每个VCU 包含多个 "wide compute units (WCUs) -> 对应 wide thread group (WTG)" |
||
42 | 每个WCU包含多个 "basic compute units (CUs) -> 对应 thread group (TG)" |
||
43 | 每个CU 包含多个 "execution units (EUs) -> 对应一个wrap" |
||
44 | 每个EU 包含多个 "processing lane -> 对应一个thread,处理一个 dem vector" |
||
45 | ``` |
||
46 | 1 | jun chen | |
47 | ## data memory model |
||
48 | |||
49 | 3 | jun chen | ### register file level: |
50 | 1 | jun chen | |
51 | 3 | jun chen | 每个register可以包含dem32 或两个 dem16。 |
52 | immediate number 包含 0~10, 1/2 ~ 10/1, pi,等,目前共计32个 |
||
53 | Mask registers: 每个thread用 1bit m0, m1 作为activate/deactivate 状态标志位 |
||
54 | |||
55 | 控制寄存器 |
||
56 | 1 | jun chen | ``` |
57 | 3 | jun chen | 32个 m0/m1 组合成 wrap mask register wm0, wm1. thread 启动时 m0 m1 置为1,wrap启动时置为0 |
58 | Descriptor offser register (DORs), 描述符寄存器被instruction schedular 操作 |
||
59 | Warp-scalar descriptor register (WSDRs) 用于 存储 resource descriptor |
||
60 | |||
61 | 2 | jun chen | ``` |
62 | |||
63 | 3 | jun chen | 数据data 寄存器 |
64 | ``` |
||
65 | Thread-local registers (TLR) (32bit, r0-r511,通过 indirect address 访问) |
||
66 | Closed segment regisers (x0) (queue of four 32bit entries per thread, 给wrap运行期间独享的 temp 寄存器) |
||
67 | Staging registers (g0-g15) (存储 LSA 的操作数,在g0-g15 ready 后, LSA 指令才会执行) |
||
68 | Wrap scalar registers (WSR) (32bit, q0-q63, indirect address) 注意,数据在 WSR 内部以bf20 方式存储。进出转换 bf16/fp16 |
||
69 | Constant scalar regisers (CSR) (32bit, c0-c1023) 注意,数据在 CSR 内部以 bf20 方式存储。 |
||
70 | ``` |
||
71 | |||
72 | 1 | jun chen | cache level |
73 | ``` |
||
74 | L1 cache |
||
75 | L2 cache |
||
76 | ``` |
||
77 | |||
78 | 3 | jun chen | HBM level (Memory level, can be accessed by program) |
79 | |||
80 | 分为三类: buffers (TLM, GSM, GLM, CB, and MBUF) 通过 byte address 寻址 |
||
81 | images. 通过坐标寻址 |
||
82 | memory 通过descriptor 描述符访问 (memory object) |
||
83 | |||
84 | 1 | jun chen | ``` |
85 | 3 | jun chen | buffer memory objects (MBUF) |
86 | image memory objects 存储 1D image (M1D), 1D image arrays (M1DA), M2D, M2DA, M3D |
||
87 | 1 | jun chen | matrix memory objects |
88 | 3 | jun chen | global memory buffers (GBM) |
89 | constant buffers (CB) constant,只读 |
||
90 | Thread-local memory (TLM) 作为thread的local memory独占使用 |
||
91 | Group shared memory (GSM) (每个 thread group 独占访问) |
||
92 | GEMM input buffer (GIB) |
||
93 | GEMM main buffer (GMB) |
||
94 | GEMM reduction buffer (GRB) |
||
95 | ``` |
||
96 | |||
97 | # instruction |
||
98 | |||
99 | 指令是64 或 128 bit |
||
100 | |||
101 | ## wrap scalar 指令 |
||
102 | |||
103 | CT (control transfer), LSA (load-store-atomic) including fence/flush/invalidate, |
||
104 | wrap scalar 指令: 64 bit, 0~31bit 是标志位,opcode, (destination register index),src1,src2,32~63 bit 是immediate number |
||
105 | 其中 opcode 的ALU 指令包含 scalar nop, move, add,... and, move, 原子计算等,目前有0~31, 96, 97 种 |
||
106 | 其中 opcode 的CT 指令包含 JUMP, CALL, RET, BNZ, END 等,目前有32~63种 |
||
107 | 其中 opcode 的LSA 指令包含 ACK, FLUSH, Invalidate cache, config 等64~82 种 |
||
108 | |||
109 | |||
110 | |||
111 | ## wrap SIMT 指令 |
||
112 | |||
113 | ALU , LSA, SFU (special function unit) |
||
114 | 指令通过mask操作,和thread关联。thread 改变mask才能执行,否则不执行 |
||
115 | |||
116 | ### SPU operation 操作 |
||
117 | |||
118 | 进行单操作数的数学计算,例如 sine, cosine, log2, e^2, sigmoid, etc. |
||
119 | |||
120 | ### ALU 指令执行完后,数据写回有如下可能路径: |
||
121 | ``` |
||
122 | TLR/x0/WSR/WMx ( fused = 0, staging = 0,即不写入staging缓冲区,也不融合传递) |
||
123 | SFU (fused = 1, staging = 0, 不写入缓冲区,融合计算) |
||
124 | LSA/TEX/MAT (staging =1, shader stagingcnt>0, 写出mem) |
||
125 | FB( frame buffer), staging = 1, shader = FS(fragment shader) |
||
126 | VB( vertex buffer), staging= 1, shader !=FS |
||
127 | |||
128 | 1 | jun chen | ``` |