Project

General

Profile

芯片设计 » History » Version 2

jun chen, 08/14/2025 04:50 PM

1 1 jun chen
# 芯片设计
2
3
4
|  |Goal |NoGoal |
5
|--|--|--|
6
| GPGPU 目标 | high-performance parallel computing, Rendering, Raytracing |  |
7
| Standards | OpenCl 1.2, CUDA, Vulcan, Direct3D, OpenGL, OpenCL 3. SPIRV | RISC-V , OpenCL 2  |
8
| Performance | Performance | area  |
9
|  |  |  |
10
  
11
12
## data model
13
14
dem (data element) 是GPGPU处理的最小单位,一个dem可以是 8, 16,32, 64 bits.
15
dem vector: 可以由 4 X dem8, 2 X dem16, 1 X dem32, 1 X dem64 组成
16
dem array:  由m个dem array 组成, 一个array可能是 8 ~ 256 bit,有 m 行 n 列组成。其中 n 是每个vector所包含的 dem 个数, m 是vector个数
17
dem fiber:  按照列进行区分?在dem16 两列 dem array场景下,low 16bits is low dem, high 16 bits is high dem
18
19
single thread 处理一个 dem vector, 一个dem vector包含多个 dems
20
在gpgpu 处理时,数据(matrix/vector) 先存储到 TLRs,再计算。dem32 以下占一个register。dem64占两个register。
21
因此一个 m行n列的 dem array,无论用 dem8, dem16, dem32,都会占满一个reg 一行
22
23
## data process hierarchy:
24
25
```
26
GPU 系统包含多个     "计算节点 (nodes)  -> 对应 node-wide thread group (NTG), 虚拟node"
27
每个计算节点包含多个 "device  ->  对应 device-wide thread group (DTG), 虚拟device"
28
每个device包含多个   "very wide compute units (VCUs)  -> 对应 VCU wide thread group (VTG)"
29
每个VCU 包含多个     "wide compute units (WCUs) -> 对应 wide thread group (WTG)"
30
每个WCU包含多个      "basic compute units (CUs)  -> 对应 thread group (TG)"
31
每个CU 包含多个      "execution units (EUs)  -> 对应一个wrap"
32 2 jun chen
每个EU 包含多个      "processing lane  ->  对应一个thread,处理一个 dem vector"
33
```
34
35
## data memory model
36
37
register file level:
38
39
```
40
Thread-local registers (TLR)
41
Closed segment regisers (x0)
42
Staging registers (g0-g15)
43
Wrap scalar registers (WSR)
44
Constant scalar regisers (CSR)
45
Group shared memory (GSM)
46
GEMM input buffer (GIB)
47
GEMM main buffer (GMB)
48
GEMM reduction buffer (GRB)
49
```
50
51
cache level
52
```
53
L1 cache
54
L2 cache
55
```
56
57
HBM level
58
```
59
buffer memory objects
60
image memory objects
61
matrix memory objects
62
global memory buffers (GLM)
63
constant buffers (CB)
64
Thread-local memory (TLM)
65 1 jun chen
```