[在此处键入]
4. (micro code understanding) read the read_material.docx (attached in mail or the paper offered to you), and answer questions below
Code:
mrr r3, r1;
srai r10, r3, 0x1e8;
seqi r4, r10, 0x5;
ld r3, `LdPfpPrgmStrmSelect(r0);
srai r2, r3, 0x3c;
mrr r3, r1;
Questions on this code above(all registers except r1 are general registers):
a. Please describe the assembly code above by a high level language (i.e C/C++ or python, or other language…)
b. If:
There is a FIFO as below (currently only 2 elements left as below, no input to the FIFO anymore), and it will always pop up value into the source register r1:
in→
0x0fe00504
0x00000500
out→
And:
`LdPfpPrgmStrmSelect(r0) return 1
What will be the final result of r2,r3,r4,r10?
5. Doc reading
Background: Software uses UBF Performance Monitor counters to count specific events that occur within the Unified Bus Fabric. Events can be specified by UBF::UBF_PERF_CTL, and read the value back from UBF::UBF_PERF_CTR.
UBF Performance monitors are enabled for a single component instance at a time. Software specifies the desired component InstanceID by setting UBF::UBF_PERF_CTL[EventSelect[13:6]] as shown in the following Table131 [UBF Performance Monitor Event Select to Component Map]. The component-specific event is selected by UBF::UBF_PERF_CTL[EventSelect[5:0]].
Table 131: UBF Performance Monitor Event Select to Component Map
EventSelect[13:6]
(UBF Component InstanceID)
UBF Component
00h to 13h
CIS0 – CIS19
14h to 27h
KGV0 – KGV19
28h to 29h
CIC0 – CIC1
2Ah
MMTUS0
2Bh
IOMMU0
2Ch
GIE0
2Dh
GAHUB0
2Eh
AUREST0
2Fh
WIFIMS0
30h
POMIE0
31h to 46h
BUCDX0 – BUCDX21
(UBF::UBF_PERF_CTL)
Read-write. Reset: 0000_0000_0000_0000h.
UBF::UBF_PERF_CTL[63:0] is an alias of UBF::PerfMonCtlHi[31:0],UBF::PerfMonCtlLo[31:0]}.
Software must program register UBF::PerfMonCtlHi before enabling the Performance Monitor via UBF::PerfMonCtlLo[En].
Bits
Description
63:61
Reserved.
60:59
EventSelect[13:12]: performance event select. Read-write. Reset: 0h.
58:36
Reserved.
35:32
EventSelect[11:8]: performance event select. Read-write. Reset: 0h. See EventSelect[7:0].
31:23
Reserved.
22
En: enable performance counter. Read-write. Reset: 0. 1=Performance event counter is enabled.
21:16
Reserved.
15:8
UnitMask: event qualification. Read-write. Reset: 00h. Each UnitMask bit further specifies or qualifies the event specified by EventSelect. All events selected by UnitMask are simultaneously monitored.
7:0
EventSelect[7:0]: event select. Read-write. Reset: 00h. This field, along with EventSelect[13:12] and EventSelect[11:8] above, combine to form the 14-bit event select field, EventSelect[13:0]. EventSelect specifies the event or event duration in a processor unit to be counted by the corresponding UBF_PERF_CTR[3:0] register.
Some events are reserved; when a reserved event is selected, the results are undefined
UBF::UBF_PERF_CTR[63:0] is an alias of {UBF::PerfMonCtrHi[31:0],UBF::PerfMonCtrLo[31:0]}.
Since Performance Monitor counters are 48-bit counters, two 32-bit reads are required to get the entire value. The high bits are latched when the low bits are read. This means that software should read UBF::PerfMonCtrLo first, then read UBF::PerfMonCtrHi to ensure the proper value is read.
UBF::PerfMonCtrLo
Reset: 0000_0000h.
Lower Counter register for UBF Performance Monitors.
Bits
Description
31:0
CTR_31_0. Reset: 0000_0000h. Check: SKIP. CTR[47:0] =
{UBF::PerfMonCtrHi[CTR_47_32],UBF::PerfMonCtrLo[CTR_31_0]}. Returns the current value of the event counter.
UBF::PerfMonCtrHi
Reset: 0000_0000h.
Upper Counter register for UBF Performance Monitors.
Bits
Description
31:16
Reserved.
15:0
CTR_47_32. Reset: 0000h. Check: SKIP
Questions:
1. Please provide enumeration for table 131:
enum UBFComponentInstanceID
{
};
2. Please fill the structure/union for mentioned registers’ description:
union UBF_PerfMonCtlLo
{
uint32_t u32All;
struct
{
uint32_t EventSelect_7_0 : 8;
} bits;
}
union UBF_PerfMonCtlHi
{
uint32_t u32All;
struct
{
} bits;
}
3. To get perf counter value for eventID=3 with UnitMask=0xf of UBF component KGV13,
Known API:
void register_write(reg_addr, reg_value);
void register_read(reg_addr, ®_value);
please fill following functions:
void programPerfMonCtl_KGV13_3_0xf()
{
}
uint64_t dumpPerfMonCtr_KGV13_3_0xf()
{
}
8. As shown in picture below is a simple GPU
8 Process Unit operate on 2Ghz frequency(sclk), 1 L1 cache operate on 1Ghz clk(fclk), 1 memory operate on 512Mhz clk(mclk), if as shown here, L1 cache read bandwidth is 16*64Byte/cycle in fclk, memory read bandwidth is 16*32Byte/cycle in mclk, L1 cache read hit rate is 60%, each process unit can handle 4 pixels per sclk cycle at most(process unit peak rate, but this is not considering effect from any other non-Process Unit hardware blocks’ threshold), and 16Byte data for each pixel need to be read from cache or memory to handle that pixel in 1 process unit.
Question:
How many pixels per second could actually be handled at most in the GPU’s whole 8 process units considering the cache/memory bandwidth and the L1 cache hit rate? Please list formula about how you calculate this. (No need to consider memory efficiency or utilization here, this means we assume here they are all 100%, super high mem efficiency and utilization)
Question 9 in next page…
9. NOTE:
each pipeline stage in any cycle(or moment) can only serve 1 instruction, 1 stage cannot serve 2 instructions at the same timing due to structure hazard(limitation), if 1 instruction(inst A) finish e.g. Decode stage, but Execute stage is still unavailable serving other instruction, inst A will go to buffer(don’t need to care the size of these buffers, always enough) between decode and execute stage at once, execute stage will later fetch that instruction from this buffer itself in proper timing immediately while itself is available
For each instruction, 1st operand is dst operand, 2nd and 3rd operand is src operand, examples shown as below:
Question:
The table above shown is only running status of instruction 0 to instruction 6, please complete the table below to show running status of instruction 7 to instruction 10
Instruction
Fetch
Decode
Execute
Write Back
7
8
9
10
instruction name(op)dstsrc0src1
addR3,R1,R2
loadR6,[R3]
read src0/1, and write dst, op(src0,src1)–>dst
[R3] means load value from this address(base+R3)
2GHz 1GHz512MHz
x8
Process UnitL1 cachememory
16*64Byte/cyclefor read16*32Byte/cyclefor read