Design Space Exploration
Design Space Exploration
Solutions to project assignments are to be developed within your group, without collaboration with other groups. However, as the projects in this class require the use ofsoftware tools and frameworks that students may have uneven prior familiarity with, discussion and assistance among students in gaining expertise with these software tools constitutes acceptable behavior. Note that this assistance and discussion cannot include
the sharing of access to any code produced in solution to the project assignments. In
order to avoid potential ambiguity in what constitutes ”code produced in solution to theproject assignment,” students wishing to aid their peers with auxiliary supporting scripts,mechanisms, or examples are directed to pass any such artifacts to the course staff for vetting and possible inclusion on project-specific FAQs rather than share it with their peersdirectly.In this project, you are going to use SimpleScalar as the evaluation engine to performa design space exploration, using the provided framework code, over a 18dimensionalprocessor pipeline and memory hierarchy design space (some of these dimensions are notindependent). You will use a 5-benchmark suite as the workload.
- Project GoalYour assignment is to, with an evaluation count limit of 1000 design points, explore thedesign space in order to select the best performing design under a set of two differentoptimization functions. These include:
- The “best” performing overall design (in term of the geometric mean of normalized execution time normalized across all benchmarks)
- The most energy-efficient design (as measured by the lowest geometric mean ofnormalized energy-delay product [units of energy delay product are joule-seconds]across all benchmarks)
- Background
2.1. SimpleScalar SimpleScalar is an architectural simulator which enables a study of how different processor and memory system parameters affect performance and energy efficiency. Thesimulator accepts a set of system design parameters and an executable (workload) to runon the described system. A wide range of system statisticsare recorded by the simulatoras the executable runs on the simulated system. Once the framework in this project issetup, interested readers can have a look at one of the log files in rawProjectOutputData folder to view SimpleScalar output.This project heavily uses SimpleScalar but most of the interface is abstracted out by asimpler framework interface. Nevertheless, you can refer to this SimpleScalar guide fordetails about parameters passed to SimpleScalar.
2.2. Design Space Exploration Given a set of design parameters, Design Space Exploration (DSE) involves probing var
ious design points to find the most suitable design to meet required goals. Follow thisquick reading about DSE before moving ahead.
DSE can be performed for different design goals. For example, one DSE may want tofind the best performing design whereas another DSE may be aimed at finding the mostenergy efficient design. A more complex DSE may look for the best performing designgiven a fixed energy budget.An exhaustive DSE simply tries out all possible combinations of parameter values tfind the absolute best design. However, as the size of design space increases this approachquickly becomes infeasible. Consider a 10-dimensional design space with 5 possiblevalues for each parameter and 2 minutes simulation time to evaluate a given designpoint;an exhaustive search will take 5 10 ∗ 2min ≈ 37years.A more intelligent DSE employs heuristics t代写Design Space Exploration o intelligently prune down the design spaceand to prioritize evaluation of more reasonable design points first. If the assumptionsemployed by the heuristics are correct, the DSE will still result in the best design. On theother hand. with a set of reasonably justified assumptions a heuristic can result in a “goodenough” design point.
2.3. Energy-Delay Product Energy-Delay Product (EDP) is a metric which consolidates both performance and energy
efficiency.EDP = total execution energy * execution time Design A takes 100pJ to process an image in 100ms, EDP = 10000 units. Design Btakes 80pJ to process an image in 2000ms, EDP = 160000. Design A is clearly moreenergy efficient, but it performs poorly as it incurs more execution time. EDP enables amore holistic design comparison.
- Our HeuristicWe define OurHeuristic as follows:
- Design space dimensions can be labelled as either explored and unexplored.
- Initially all dimensions are unexplored
- Choose an unexplored dimension, exit if all dimensions are explored3.1. Evaluate all possible design points by changing the value of this dimensiononly23.2. Fix value of this dimension by selecting the best design so far (considerDSE goal)3.3. Mark this dimension as explored
- Go to step 3.You should choose an unexplored dimension in step 3 based on your PSU ID Numbersof students in the group, as follows.DSE dimensions can be categorized in four major classes as follows:
- Branch predictor (BP) configurations (i.e. branchsettings, ras, btb)
- Cache configurations (i.e. {l1, ul2}block, {dl1, il1, ul2}sets, {dl1, il1, ul2}assoc)
- Core configurations (i.e. width, scheduling)
- Floating Point Unit (FPU) configuration (i.e. fpwidth)Based on your
ID numbers, you should calculate
(
8.2.3. Cache and Memory
Following list comprises tuples of format: [cache size or memory, access energy(pJ),
leakage/refresh power(mW)]
- 8KB: 20pJ, 0.125mW
- 16KB: 28pJ, 0.25mW
6• 32KB: 40pJ, 0.5mW
- 64KB: 56pJ, 1mW
- 128KB: 80pJ, 2mW
- 256KB: 112pJ, 4mW
- 512KB: 160pJ, 8mW
- 1024KB: 224pJ, 16mW
- 2048KB: 360pJ, 32mW
- Main Memory: 2nJ, 512mW
8.2.4. Energy per Committed Instruction
- Dynamic, fetch width = 1: 10pJ
- In-order, fetch width = 1: 8pJ
- Dynamic, fetch width = 2: 12pJ
- In-order, fetch width = 2: 10pJ
- Dynamic, fetch width = 4: 18pJ
- In-order, fetch width = 4: 14pJ
- Dynamic, fetch width = 8: 27pJ
- In-order, fetch width = 8: 20pJ
8.3. Validation Constraints
You must implement these validation constraints in your code. Specifically, validate
Configuration and generateCacheLatencyParams must be implemented properly.
- The il1 (L1 instruction cache) block size must be at least the ifq (instruction fetchqueue) size (e.g., for the baseline machine the ifqsize is set to 1 word (8B) thenthe il1 block size should be at least 8B). The dl1 (L1 data cache) should have thesame block size as your il1.
- The ul2 (unified L2 cache) block size must be at least twice your il1 (and dl1)block size with a maximum block size of 128B. Your ul2 must be at least twice aslarge as il1+dl1 in order to be inclusive.
- il1 size and dl1 size: Minimum = 2 KB; Maximum = 64 KB
- ul2 size: Minimum = 32 KB; Maximum = 1 MB
- The il1 sizes and il1 latencies are linked as follows (the same linkages hold for the
dl1 size and dl1 latency):
(a) il1 = 2 KB means il1lat = 1
(b) il1 = 4 KB means il1lat = 2
(c) il1 = 8 KB means il1lat = 3
(d) il1 = 16 KB means il1lat = 4
(e) il1 = 32 KB means il1lat = 5
(f) il1 = 64 KB means il1lat = 6
(g) The above are for direct mapped caches. For 2-way set associative add 1
additional cycle of latency to each of the above; for 4-way add 2 additional
cycles; for 8-way add 3 additional cycles.
- The ul2 sizes and ul2 latencies are linked as follows:
(a) ul2 = 32 KB means ul2lat = 5
(b) ul2 = 64 KB means ul2lat = 6
7(c) ul2 = 128 KB means ul2lat = 7
(d) ul2 = 256 KB means ul2lat = 8
(e) ul2 = 512 KB means ul2 lat = 9
(f) ul2 = 1024 KB (1 MB) means ul2lat = 10
(g) The above are for direct mapped caches. For 2-way set associative add 1additional cycle of latency to each of the above; for 4-way add 2 additionalcycles; for 8-way add 3 additional cycles; for 16-way add 4 additionalcycles.
8.4. Miscellaneous Constraints These constraints have already been specified in the framework. Have a look at SimpleScalar invocation command in runprojectsuite.sh for an exhaustive list of specifiedparameters. Moreover, any parameter not specified in runprojectsuite.sh will default toSimpleScalar default settings.
A.2. Plots The report should include the following four plots:
- Line plot of normalized geomean execution time (y axis) for each considered design point vs. number of designs considered (x axis)
- Line plot of normalized geomean of energy-delay product (y axis) vs number ofdesigns consideredBar chart showing normalized per-benchmark execution time and geomean normalized execution time for the best performing designBar chart showing per-benchmark normalized energy-delay product and geomeannormalized energy delay product for the most energy-efficient design found
These four plots must be labelled in your report corresponding exactly to num
bering in the list above. Furthermore, axis in the plots should be properly labelled.
9A.3. Other Guidelines For clarity in the written report, when listing the best design points, please do not represent
be assigned for following the guidelines and adhering to appropriatelevels of clarity, and style (and spelling, grammar, etc.) for a technical document.
10B. Project FAQs
Q: What are the column headers for the .log file?
A: normalized EDP, normalized Execution time, absolute EDP, absolute Execution
time. The writes to both the .best and .log files are generated near the end of main.
Q: What are the column headers for the .best file?
A: Headers differ by line:
Line 1 headers: bestEDPconfig, normalized EDP of bestEDPconfig, normalized Execution time of bestEDPconfig, absolute EDP of bestEDPconfig, absolute Execution timeof bestEDPconfig, absolute EDP of Bench 0 on bestEDPconfig, normalized EDP of Bench0 on bestEDPconfig, absolute EDP of Bench 1 on bestEDPconfig, normalized EDP ofBench 1 on bestEDPconfig, absolute EDP of Bench 2 on bestEDPconfig, normalizedEDP of Bench 2 on bestEDPconfig, absolute EDP of Bench 3 on bestEDPconfig, normalized EDP of Bench 3 on bestEDPconfig, absolute EDP of Bench 4 on bestEDPconfig,normalized EDP of Bench 4 on bestEDPconfig
Line 2 headers: bestTimeconfig, normalized EDP of bestTimeconfig, normalized Ex
ecution time of bestTimeconfig, absolute EDP of bestTimeconfig, absolute Executiontime of bestTimeconfig, absolute Time of Bench 0 on bestTimeconfig, normalized Timeof Bench 0 on bestTimeconfig, absolute Time of Bench 1 on bestTimeconfig, normalized Time of Bench 1 on bestTimeconfig, absolute Time of Bench 2 onbestTimeconfig,normalized Time of Bench 2 on bestTimeconfig, absolute Time of Bench 3 on bestTimeconfig, normalized Time of Bench 3 on bestTimeconfig, absolute Time of Bench 4 onbestTimeconfig, normalized Time of Bench 4 on bestTimeconfig
Q: Why are there only 18 configuration parameters when SimpleScalar (and the
project specification) list so many more?
A: There are 18 configuration variables, and more derived settings from those 18 con-figuration variables, and still more settings that are fixed as constant (e.g. MPLAT). Giventhe block size (set independently), associativity (set independently), and number of sets(set independently), you can determine total cache size for the L1D and I caches and thenvalidate if the latency for that cache (set independently) is set correctly.
Q: What’s a quota error, why are half my output files empty, and why can’t I
make new files anymore?
A: It means you are out of disk space. Each run of this program produces a large number of intermediate output files for the evaluated design points. These are kept to speedup subsequent evaluations of the same design point in future runs as a means of reducingdebugging/heuristic development time. Consider cleaning out your browser caches if youare low on disk quota before performing a project run.