Computer Organization and Design--计组作业习题(6)
Computer Organization and Design
----------------------个人作业,如果有后辈的作业习题一致,可以参考学习,一起交流,请勿直接copy
Problem 2. Cache Associativity (8 points)
For this question, you will simulate different configurations of an 8 block cache using the following
8-bit memory accesses:
load 100
store 102
store 88
load 120
load 90
a) Simple Questions
i) If the block size is 4 bytes, how large must the cache be?
----------32 bytes ;
ii) How many bits will be used for the offset?
----------2 bits will be used for the offset ;
b) Fully-associative with a 4-byte block size
i) How many bits should be used for the tag?
----------6 bits should be used for the tag ;
ii) Complete the table by simulating the above memory references
Show each entry, and simply cross out blocks as they are overwritten
Block# |
Tag |
Dirty |
0 |
25 |
0->1 |
1 |
22 |
1 |
2 |
30 |
|
3 |
|
|
4 |
|
|
5 |
|
|
6 |
|
|
7 |
|
|
c) Direct-mapped with a 4-byte block size
i) How many bits should be used for the tag?
------------3 bits should be used for the tag ;
ii) Complete the table by simulating the above memory references
Show each entry, and simply cross out blocks as they are overwritten
Block/Set# |
Tag |
Dirty |
0 |
|
|
1 |
3 |
0->1 |
2 |
|
|
3 |
|
|
4 |
|
|
5 |
|
|
6 |
2->3->2 |
1->0->0 |
7 |
|
|
d) 2-way Set-associative with a 4-byte block size
i) How many bits should be used for the tag?
------------4 bits should be used for the tag ;
ii) Complete the table by simulating the above memory references
Show each entry, and simply cross out blocks as they are overwritten
Set# |
Block# |
Tag |
Dirty |
0 |
0 |
|
|
|
1 |
|
|
1 |
2 |
6 |
0->1 |
|
3 |
|
|
2 |
4 |
5 |
1 |
|
5 |
7 |
0 |
3 |
6 |
|
|
|
7 |
|
|
e) Matching
Note: Some may have multiple answers and some answers may not be used.
__bd___full-associative cache a) Fewest connections to memory are needed
__ae___ direct-mapped cache b) Provides optimal cache utilization
___f___ set-associative cache c) Allows a larger block size to be used
d) Largest Tag overhead
e) Smallest Tag overhead
f) May prevent conflict when two blocks have the same set
(N/A to full-associative caches)
Problem 3. Cache Comparison (8 Points)
EZ-Cache Company has hired you to design their next generation cache for the LC2k. They have want to use a 32-byte direct-mapped cache with a 4-byte block size.
a) Using the following sequence of memory accesses, compute the number of cache hits:
load 200
store 204
store 236
load 201
load 208
store 234
load 204
load 239
store 201
Block/Set# |
Tag |
Dirty |
0 |
|
|
1 |
|
|
2 |
6->7->6 |
|
3 |
6->7->6->7 |
|
4 |
6 |
|
5 |
|
|
6 |
|
|
7 |
|
|
Number of Hits: ________1________
b) EZ-Cache believes they can improve the hit rate by either increasing the block size or increasing the associativity
Simulate the above memory accesses for a 32-byte direct-mapped cache with an 8-byte block size
Block/Set# |
Tag |
0 |
|
1 |
6->7->6->7->6->7->6 |
2 |
6 |
3 |
|
Number of Hits: _________1________
Simulate the above memory accesses for a 32-byte 2-way associative cache with a 4-byte block size
Set# |
Block# |
Tag |
0 |
0 |
13 |
|
1 |
|
1 |
2 |
|
|
3 |
|
2 |
4 |
12 |
|
5 |
14 |
3 |
6 |
12 |
|
7 |
14 |
Number of Hits ________4__________
Which one of the configurations is best for the memory access sequence?
-----------2 ways ,4 bytes block associative cache ;
Problem 4 (12 points)
You have been given the following two caches which are both byte addressable and use 16 bit memory addresses.
Cache |
Cache A |
Cache B |
Total size (Bytes) |
16 |
16 |
Block size (Bytes) |
4 |
4 |
Organization |
Fully Associative |
Direct Mapped |
Replacement policy |
LRU |
- |
Write policy |
Allocate on write |
Allocate on write |
a) The following addresses are referenced in the given order; please put an H for each of the hits and an M for each of the misses for both the caches. Also calculate the hit rate for each cache. An extra column for an infinite size fully-associative cache (also of block size 4 bytes) is given to make the calculation for part (b) easy. [8 pts]
Address(hex) |
Address(binary) |
Infinite |
Cache A |
Cache B |
0x0000 |
0000 0000 0000 0000 |
M |
M |
M |
0x0007 |
0000 0000 0000 0111 |
M |
M |
M |
0x0003 |
0000 0000 0000 0011 |
H |
H |
H |
0x0009 |
0000 0000 0000 1001 |
M |
M |
M |
0x0016 |
0000 0000 0001 0110 |
M |
M |
M |
0x0005 |
0000 0000 0000 0101 |
H |
H |
M |
0x000D |
0000 0000 0000 1101 |
M |
M |
M |
0x0001 |
0000 0000 0000 0001 |
H |
M |
H |
Hit Rate |
|
3/8 |
2/8 |
2/8 |
-----------------(2/8+2/8)*2*8=8 ;
b) For each reference in the previous sequence of references, classify them using one of the four possible labels HIT (if the access is a hit) or COMPULSORY / CAPACITY / CONFLICT (if it’s a miss depending on the type of miss) [4 pts]
Address(hex) |
Cache A |
Cache B |
0x0000 |
COMPULSORY |
COMPULSORY |
0x0007 |
COMPULSORY |
COMPULSORY |
0x0003 |
HIT |
HIT |
0x0009 |
COMPULSORY |
COMPULSORY |
0x0016 |
COMPULSORY |
COMPULSORY |
0x0005 |
HIT |
CONFLICT |
0x000D |
COMPULSORY |
COMPULSORY |
0x0001 |
CAPACITY |
HIT |
------------0.25*8*2=4 ;
Problem 5 (10 points)
The picojoule microprocessor has a byte-addressable ISA and only 64 bytes of memory. It has a 16 byte, 2-way set-associative, write-back, write-allocate cache, and uses a block size of 2 bytes. Each load / store instruction accesses a single byte. The OB0 and OB1 fields in the cache hold the 2 data bytes in a block (1 byte each). Given the following sequence of instructions, update the cache after each instruction. When both ways in a set are invalid and a block has to be allocated, the cache logic puts higher priority on way 0. Use decimal value for the B0 and B1 field and binary for the rest, if a cache block is invalid you don’t have to fill in anything, if you don’t know the value of a certain field for a valid block put a X there. The initial empty state of the cache is given. The content of the following memory locations are known:
M[4]=7
M[14]=11
M[15]=13
M[37]=17
M[45]=19
The instructions (LD is a load and ST is a store) follow:
1: LD R1 ← M[4]
2: LD R2 ← M[37]
3: ST R1 → M[36]
4: ST R2 → M[5]
5: LD R1 ← M[15]
6: LD R2 ← M[14]
7: LD R1 ← M[45]
8: ST R2 → M[44]
9: HALT
Part (a) [8 points]
Initial
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 3 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
After instruction 1 1: LD R1 ← M[4]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
0 |
|
000 |
7 |
X |
0 |
|
|
|
|
|
Set 3 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
After instruction 2 2: LD R2 ← M[37]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
0 |
LRU |
000 |
7 |
X |
1 |
0 |
|
100 |
X |
17 |
Set 3 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
After instruction 3 3: ST R1 → M[36]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
0 |
LRU |
000 |
7 |
X |
1 |
1 |
|
100 |
7 |
17 |
Set 3 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
After instruction 4 4: ST R2 → M[5]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
1 |
|
000 |
7 |
17 |
1 |
1 |
LRU |
100 |
7 |
17 |
Set 3 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
After instruction 5 5: LD R1 ← M[15]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
1 |
|
000 |
7 |
17 |
1 |
1 |
LRU |
100 |
7 |
17 |
Set 3 |
1 |
0 |
|
001 |
11 |
13 |
0 |
|
|
|
|
|
After instruction 6 6: LD R2 ← M[14]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
1 |
|
000 |
7 |
17 |
1 |
1 |
LRU |
100 |
7 |
17 |
Set 3 |
1 |
0 |
|
001 |
11 |
13 |
0 |
|
|
|
|
|
After instruction 7 7: LD R1 ← M[45]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
1 |
LRU |
000 |
7 |
17 |
1 |
0 |
|
101 |
X |
19 |
Set 3 |
1 |
0 |
|
001 |
11 |
13 |
0 |
|
|
|
|
|
After instruction 8 8: ST R2 → M[44]
|
Way 0 |
Way 1 |
||||||||||
|
V |
D |
lru |
Tag |
OB0 |
OB1 |
V |
D |
lru |
Tag |
OB0 |
OB1 |
Set 0 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 1 |
0 |
|
|
|
|
|
0 |
|
|
|
|
|
Set 2 |
1 |
1 |
LRU |
000 |
7 |
17 |
1 |
1 |
|
101 |
11 |
19 |
Set 3 |
1 |
0 |
|
001 |
11 |
13 |
0 |
|
|
|
|
|
Part (b):
In total how many bytes are written to memory for executing instruction 1 to 8 (including instruction 8) ? How many more bytes will have to be written to memory after HALT is executed? [2 points]
------------In total 2 bytes are written to memory for executing instruction 1 to 8 ;
2x2=4 bytes will have to be written to memory after HALT is executed.
Problem 6 (8 points)
A certain workload having the following instruction mix is run on two processor designs with both having I-Cache and D-Cache.
ADD 10%
NAND 20%
BEQ 25%
SW 15%
LW 30%
Additionally, it is known that I-Cache Hit-rate is 90%, D-Cache Hit-rate is 98%, 45% branches are not taken and 25% of LW instructions are followed by a dependent instruction. The memory takes 75 nano-seconds to access.
a) Assuming the above code is run on a standard LC-2K 5-stage pipeline design processor with forwarding and with branches predicted not taken and clocked at 200MHz, what is the CPI? Show your work. [3 points]
Clock period : 1 / 200MHz = 5ns
Cache : 75 ns / 5ns = 15 cycles
CPI = 1 + 1*0.10*15 + (0.3+0.15)*0.02*15 + 0.3*0.25*1 + 0.25*0.55*3 = 3.1225
b) This five stage pipeline is extended to a similar 15 stage pipeline with no additional hazards being introduced. The amount of stall cycles needed for a lw followed by a dependent instruction does not change. The new frequency is 400MHz. Now the same code is run on the 15 stage pipeline where branches are resolved in the 11th stage.
I. What is the new CPI? Show your work. [4 pts]
II. Does this new design result in better performance for this workload? [1 pt]
I :
Beq : 11 – 1 = 10 cycles
Clock period : 1/400 MHz = 2.5ns
Cache : 75ns / 2.5 ns = 30 cycles
CPI = 1 + 1*0.1*30 + (0.3+0.15)*0.02*30 + 0.3*0.25*1 + 0.25*0.55*10 = 5.72
II :
(a) : 5 ns * 3.1225 = 15.6125 ns ;
(b) : 2.5 ns * 5.72 = 14.3 ns ;
Yes, this new design result in better performance for this workload.