A New Formulation of Neural Data Prefetching(省流版)
一句话解释:硬件上搭神经网络用于prefetch,一个非常疯狂的想法,但架不住有人试图证明这是可行的。
期末展示作业。。我想想什么时候再把这玩意精简一下,并且翻译成中文。
A New Formulation of Neural Data Prefetching
abstract
this is the Scratch paper of the article.
Temporal data prefetchers:
potential:
- to produce performance gains by prefetching irregular data streams.
Recent work:Neural Model in hardware?
- a neural model for temporal prefetching > table based temporal prefetchers.
Drawback:
- the large storage
- latency costs
- inable to generalize to memory addresses outside
of the training dataset - not pratical in hardware.
A key observation:
currently, the temporal prefetchers typically assume that each address can be followed be any successor.
According to common knowledge, there are only a few successors for a specific address.
Target:
a new abstraction of memory addresses.
Twilight:
- reduces latency:988x(compared to Voyager)
- shrink storage:10.8x
- speedup:4%
- predict new temporal correlations not present in the training data.
- outperform STMS by 12.2%, Domino by 8.5%
T-LITE
a slimmed-down version which can prefetch across different program runs
-
1421x faster(compared to Voyager)
-
142x smaller
-
the same performace
-
5.9% faster(compared to Triage)
-
64K/1MB metadata in the cache
-
When run on unseen domains, achieves 94% pref. of a T-LITE model fine-tuned on the unseen inputs.
Introduction
data prefetcher:used to hide the long latency of memory access.
most modern hardware: identify strides or spatial footprint.
this paper focus on temporal prefetching: a type of irregular data prefetching that identifies
pairs of addresses that are temporally correlated
address X followed by address Y:
ML-based temporal prefetchers like Voyager provide significant headroom over table-based ones.(+13% speedup)
-
large model size/latency:take addresses ad input and produce them as output derectly.
-
Inability to Adapt or Generalize: the model is impossible to be trained online(too much computing power) offline(the virtual address changes)
so our choice is:
-
NOT:operate on data address directly like Voyager.
-
INSTEAD: rethink the problem of prediction so that it's more acceptable to a neural network.
Utilize layers of indirection(programming language -> machine language code), create an abstraction space.
Sparse Connectivity Hypothesis
each address is ever succeeded by few other addresses.
INSTEAD OF PREDICTING AN ADDRESS, predict a ordinal \(i\), means prefetching the \(i^{th}\) most frequent candidate.
We call it frequency based candidate selection:
closer to pratical prefetcher
Advantages of two abstraction layers: staged deployment.
Neural model training :offline, mapping between address and abstraction layers training: online.
related work
usage of ML is recent, and mostly on programs with good locality.
Voyager(not pratical) is the only prior solution which supports temporal prefetching.
ML-Based Spatial Prefetching
mostly use neural network of reinforcement learning.
neural network:
-
used clustering to share model weights.
-
compression by representing deltas in binary format.
-
spatial bitmask prediction.
-
graph specialization.
RL: lower cost, lower performance, can be trained online to make up for the drawback.
Temporal Prefetching
focus on preformance?
-
spatio-temporal correlation
-
PC-localization
-
longer histories(what Voyager relies on)
or practicality...
-
store the correlation table off-chip and optimize the memory bandwidth
-
prefetch in a distance
-
Global History Buffer
-
ISB prefetcher: metadate cached on-chip
-
MISB prefetcher:off-chip metadata can be easily prefetched to hide its off-chip access latency
-
Triage prefetcher stores the most relevant metadata on chip by repurposing a portion of the last-level cache
-
Twilight:has an online component, adapt to unseen data addresses.
Neural Temporal Prefetching
?:class explosion problem \(2^{52}\) cache lines are two many.
unique pages in the program is 1~2 orders of magnitude smaller than cache lines.
- Voyager:split cache line addresses into page and offset pairs.
the number of classes (\(2^{46}\)) os very large.
on SPEC 2006, 2 month of training, <1% validation.
Profile-Driven Dynamic Prediction
profile-driven
prefetchers: use profiling to insert SOFTWARE prefetches into the code, DIFFICULT TO TIME CORRECTLY, however, T-LITE uses HARDWARE prefetching.
T-LITE: two stages:
-
offline stage: trains the model on profiled memory traces.
-
online stage: collect runtime information
Other Hardware Prediction Tasks
typically used for cache replacement, branch prediction, and level prediction...
our work
problem formulation
sparce connectivity hypothesis
93.2% of cache lines are followed by one of their top 20 successors.(98.5%, for one of top 100)
Frequency-based candidate selection
-
select from among the top-N
most likely successors. -
add an \((N + 1)^{th}\) option so that the prefetcher can learn when not to prefetch.
-
output is an ordinal value \(i\), representing fetching \(i^{th}\) most common address, or not fetching when \(i = N + 1\).
Page Granularity
the number of unique cache grows quicker than the number of unique pages: decomposing data address into pages and offsets makes our neural model much easier.
-
Sparse Connectivity Hypothesis extends to page
-
we only apply our new problem formulation to Twilight’s page prediction(because there's only a small number of cache line offsets)
behavioral clustering
the neural model: connot deal with great number of input.
what we do: which groups together pages that have
similar prefetching behavior
and assume they will keep the similar model when fetching next page(??????)
Graph Traversal Example
\(A\) and \(B\) are considered similar if they make the same decision(order of the successor) in the similar condition.
Useful? consider a program DFS 30%, BFS 70%.
Empirical Validation
Core idea:(?) pages with similar prefetching behavior produce similar predictions.
Page embeddings implicitly capture prefetching behavior, so we examine the trained page embeddings to verify the effectiveness.
offset transitions: used to represent predict behavior of a page. \(entry(i, j)\) represents number of times (offset \(i\) is followed by \(j\))
used t-SNE to visualize a subset of Twilight’s page embeddings. Those with same color and shape has the similar offset transitions ,and they are clustered together, which means the embedding space correctly reflects the similarity among pages.
clustering
compute a set number of behavioral clusters, which yields a fixed-sized page embedding table and a fixed-sized neural model.
The training is done in two steps.
-
as in Voyager, the entire neural model and all pages' embeddings are trained offline.
-
pages in the same cluster are forced to share the same embedding.Then we fine-tune the weights by retraining the model
Dynamic Metadata Collection
The new problem:maintaining the distribution of the \(N\) most frequent sucessor pages.
Each metadata entry corresponds to an input history and stores the frequencies for each of its successors.
(naively make a table for all cases?)
!: Naively tracking data for every history has exponential cost!
if \(P\) is the number of pages, tracking the history of length \(H\) could lead to \(P^H\) distributions.
- decoupled positional frequency/DPF
assume \(f(h, n)\) is the distribution of successors that are \(n\) accesses ahead of the history \(h\).
\(f((X, Y, Z), 1) \approx f(X, 3) \times f(Y, 2) \times f(Z, 1)\)
this is \(\theta(PH)\) distributions.
- the weight of each access is not uniform, it's trivial to see that latest access is more decisive to prediction.
Model Overview
【graph】
T-LITE predicts the prefetch offset directly and an ordinal \(i\) that maps to a prefetch candidate page. Which means, T-LITE predicts the offset on the page and the page independently.
It decomposes the input address history into
- a page history -> behavioral clusters
- an offset history
also take in the raw DPF distribution.
generate our
context-aware offset embeddings
combine page and PC infomation into offset embeddings.
compared with Voyager, T-LITE's further contextualizing cache line embedding to contain information about the related access stream.
and it also has:
-
a single fully connected layer for candidate prediction.
-
a single fully connected layer for offset prediction.
advantage:parallelizable
page/cluster embedding smaller than page embeddings(for Voyager)
- limit the number of load PC
embeddings to the 4096 most occurring load PCs(the size of embedding table can't be too large)
candidate-page mapping table is stored implicitly, DPF metadata is augmented, and a 12-bit integer(representing the cluster) is included.
Model Inference
take the D/BFS as the example
【graph】
for a input history, \((B, D, E)\), \(E\) decides the prefetch candidates: \(\{F, C\}\), we construct candidates with \(\{F, C\}\)
after renormalizing, we get **DPF vectors **:
feed DPF vectors\cluster sequence\offset sequence\load PC to T-LITE.
T-LITE:compute the context-aware offset embedding and concatenates it with the DPF vectors\the raw cluster embeddings\and the raw
PC embedding.
we flatten the vector and feed it into the 2 layers for prediction.
\((1 \rightarrow F,2 \rightarrow C,3 \rightarrow \emptyset)\)
T-LITE select \(C\) because \(f(B, 3)\) shows \(F\) never occured 3 accessed ahead of \(B\)
METHODOLOGY
-
x: Voyager simulated online training with epoch of 50M instrucions.It train on one epoch to apply it on the next one.But neither Voyager nor Twilight can feasibly train online
-
v: Offline training to faithfully model staged deployment.
(train the model on one region of a Simpoint (a way of sampling) and test it on another)
Testing:
simulate the baseline and neural prefetchers using Champsim simulator/GAP benchmark suites.(and some of the google server workloads)
(200 + 25 + 175)M in a instruction pack, 200M for training, 25M for validation and 175M for evaluation.
TWILIGHT EVALUATION
[A lot of evidence that Twilight > Voyager]
Twilight sometimes shows worse performance because it cannot fetch wider successors than \(N = 20\)
most of the time, Voyager perform worse because the benchmarks have many accesses in their evaluation region to pages not in the training data.
[A lot of evidence that Twilight > non-neural baselines]
Twilight sometimes shows worse performance because there are some work patterns and load PC not found in the training data.But the gap could be bridged using LoRA to fine-tune the model.
LoRA = Low-Rank Adaptation
seperating the fine-tuning weight updates from the model -> decomposing them into 2-rank matrices, enabling cheap computation and minimal storage.
Twilight dramatically improves generalization.
Twilight’s accuracy is
constant with higher lookahead, enabling it to achieve 1.5% more speedup due to better timeliness. That's because Twilight’s online metadata component improves its adaptability.
comparison with delta prefetchers
[A lot of evidence that Twilight > delta prefetcher, IPCP and Pythia]
conclusion: hybridizing across types of prefetchers provides better performance than
just increasing degree.
conclusion when the trigger address is close to prefetch address, the traditional way did a slightly better, but in the irregular case, TWILIGHT did a lot better.
multi-core results
The gap between the neural and non-neural prefetchers grows because of the higher accuracy.
Google Results
higher adaptability enables it to achieve significantly higher accuracy.
Frequency-Based Candidate Selection
better prefetch accuracy
it improves the performance for two reasons:
- Twilight can predict not to prefetch
- the mispredictions are more likely to be useful
Generalizability
Predicted Candidate Distribution
the prediction of distribution of successor candidates is closer to the truth than naive case(simply multiply the DPF together)
T-LITE EVALUATION
DPF Metadata Management
trade-off:
- successors N = 20 -> N = 4 to reduce the space.
- only use \(f(P, 1)\), the immediate page successors.
Neural Model
Behavioral Clustering
mentioned above.After fine-tuning, the performance drop is 0.1%.
Weight Quantization
change the model weight from 32-bit floatpoint numbers to 8-bit integers.WITHOUT PERFORMANCE LOSS!
Speculative Prediction
after applying matrix extensions ISA such as Intel AMX, T-LITE's latency is approximately 29 cycles.
Since it has high accuracy, based on its own prefetch, it can do the process iteratively.
Results
conpared with Triage:
- 20.7% speedup, outperforming Triage(14.8%)
- 2.6% higher accuracy
- 7.8% higher coverage
- 64KB metadata and 0.3~0.8 MB of neural weights.(compared with 1MB metadata of Triage)
- 29 cycle prediction latency is similat to the 20 cycle prefetch delay that Triage read metadata from the cache.However, T-LITE's DPF metadata is outside of the cache.
transfer learning
Dynamic Page-to-Cluster Mapping
Dynamically mapping pages to clusters during runtime
by assigning the cluster with the most similar offset transition distribution.
Reasonable accuracy with cheaper overhead than dynamically training a page embedding.
Cross Input Evaluation
train the model on graph type A/B, validate it on graph type B/A, test it on graph type C.
Though never trained on graph type C, it still shows much better performance than Triage(30.4% speedup vs 13.8% speedup).Moreover, it is able to achieve 94% of the model finetuned on type-C graph.
Though on GAP dataset, the ability to map pages to the best cluster dynamically is poor, the mapped clusters are still with high performance.
T-LITE's limitations.
Cost more energe on the process of inference. Constrained to the progranm it was trained on and requires representative data/significant offline-training.
future work
Lightweight Neural Temporal Prefetching
We can envision a lightweight neural model working with a temporal prefetcher.
Insights for non-neural temporal prefetching
The idea of decomposition of address could likewise help future temporal prefetchers reduce their metadata by exploiting page-to-page temporal locality.
CONCLUSIONS
Voyager:showed the possibility to use neural network for prefetching.However, the cost on model size/latency/training time is so large, the direct use of address stopped us from training the model offline.
TWILIGHT: two novel layers that abstract a temporal prefetcher away from specific data addresses.
dramatically reduced the size and cost of the model. 10.8x smaller and 988x faster(latency), while outperform Voyager by 4%.
T-LITE:trading-off performace to make it more available on hardware. While matching Voyager performace while having 142x less storage and 1421x faster prediction.
Model size is 0.3~0.8MB while the on-chip metadata is only 64KB compared with Triage's 1MB on-chip metadata.
Demonstrating that the amount of dynamic metadata can be far smaller than previously thought.

浙公网安备 33010602011771号