Prefetcher Optimization

Adiletta, Matthew J.

Student Work

Prefetcher Optimization

Public

The memory subsystem is a critical component of a CPU, composed of architectures such as caches and predictors. A cache is used to reduce the latency from when a request to memory is generated and when it is available to the core. Data prefetchers further reduce latencies by speculating about future accesses and increasing data coverage. However, aggressive prefetching may increase cache pollution, create a memory bandwidth performance bottleneck, and add latency to critical path demand queues. Cache pollution may occur because the prefetcher may bring in unnecessary data into the cache that will never be used. A memory bottleneck may occur because the prefetcher generates excessive requests to memory such that critical memory demand misses incur extra latency because there is not enough allocated memory bandwidth. Finally, the core queues may fill up because of the prefetching and demand queues are shared, thus excessive prefetching will add additional latencies for these queues. Managing the aggressiveness of the prefetchers is necessary to mitigate these problems. State of the art hardware prefetcher solutions manage aggressiveness by analyzing telemetry data such as prefetcher accuracy and memory bandwidth consumption. This is an insufficient solution because telemetry data alone does not necessarily correlate to the overall system performance. Furthermore, the current solution optimizes the prefetchers individually, rather than allowing the prefetchers to work together to improve the overall system performance. Appropriate management of the prefetcher aggressiveness may lead to performance improvements. Preliminary investigations motivating our work analyze enabling and disabling the prefetchers. We demonstrate that 60% of regions of interest form the SPEC CPU2017 suite show performance improvement by disabling one of the prefetchers. We take this work a step further by managing the aggressiveness at a finer granularity than purely on or off. We propose the Aggressiveness Degree Manager (ADM), employing Q-learning to find the optimal prefetcher aggressiveness policy for multiple prefetchers at run-time. The aggressiveness degree represents the number of cache lines a prefetcher may demand on a single request. A prefetcher with aggressiveness degree zero means the prefetcher is disabled. Highly aggressive prefetchers demand up to ten cache lines in a single request. The ADM agent manages the aggressiveness degree by varying the degree from zero to a maximum threshold. The current model is designed for a single-core, single-process implementation, optimizing the MLC prefetchers. We evaluated the ADM agent using 10 prefetch sensitive workloads from the SPEC CPU2017 suite. The ADM agent demonstrated a 4.2% higher speedup than the best static hardware configuration, with a 2.6% average variance from the optimal prefetcher aggressiveness degree.

This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.

Creator