Page 283 - 2024-Vol20-Issue2

P. 283

279 | Alrudainy, Marzook, Hussein & Shafik

TABLE III.
PERFORMANCE COUNTER METRICS, AND PCPG CONTROL DECISION BASED ON WORKLOAD CLASSIFICATION DETAIL.

Metrics Definitions Metrics range Classification Freq. A7 A15
nipc (InstRet/Cycles)(1/IPCmax) urr [0, 0.11] 0: Low-activity min PCPG PCPG
iprc nnmipc [0.35, 1] 1: CPU-intensive Max PCPG max
InstRet/ClockRef nnmipc [0.25, 0.35) 2: CPU+memory min max max
nnmipc ( 1/IPCmax) (InstRet/Cycles - Mem/Cycles) nnmipc [0, 0.25) 3: mem-intensive max max PCPG
urr
Cycles/ClockRef

age/frequency combination in compliance with energy and • Class 3: intensive memory workloads.
performance requirements of the exercised workload.
Core allocations to threads (TM) are typically governed by a Large-scale investigative experiments are exercised in our pre-
scheduler [28]. A Linux scheduler typically allocates the over- vious published work in [11] to examine the rationality of
all workload across all available cores to attain substantial these inclusive concepts. These investigations demonstrate
utilization. However, given a certain performance require- that optimum energy efficiency can be achieved if fewer lit-
ment, various types of threads must be processed differently tle cores are used from memory-intensive application while
for energy and performance optimization. For example, there it is advantageous to exercise more big core in parallel for
is typically no differentiation about the specification of thread CPU-intensive applications. The classification is achieved
being allocated, including memory- or CPU-intensive. Tack- by computing a range of metrics from performance counter
ling energy efficiency in heterogeneous many core systems readings, hence obtaining the classes relying on whether if
exercising concurrent workload behaviors requires a great deal these metrics have traversed a predetermined threshold as can
of effort. This is because the state space is significantly large be illustrated in Table III. The notable performance counter
and each workload demands different optimization. There- events is listed as below:
fore, the hardware state space of a many-core heterogeneous
platform comprises all practicable DVFS combinations and • InstRet: is the retired executed instruction and is part
threads to core allocations (TM). of the largely investigated instruction per cycles (IPC)
For instance, the number of feasible big.LITTLE core alloca- metric.
tions of exercising single application is 19 as can be illustrated
in Fig. 2(a). Considering maximum of one thread per core is • Cycles: is the number of clock cycle per core allocation.
permitted, each application must have at least one thread, and • MEM ACCESS: is memory write or read operation that
one of the little cores is employed for running the operation
system. These possible scenarios are latterly multiplied by causes a cache access to at least the level of data.
the potential range of DVFS combinations as shown in Fig. • L1 CACHE: is the instruction cache access at level 1.
2(b), which is computed as MA15. MA7, where MA15 is the These metrics are calculated based on previous study
potential of DVFS combination in the A15 block, and MA7 is published in [7].For instance, the more nnmipc rises, the
the potential of DVFS combination in the A7 block. Exercis- more CPU-intensive workload becomes. Therefore, for CPU-
ing two applications concurrently requires 111 possible core intensive workload all A7 cores must be powered off while
allocation making energy efficiency optimization is extremely A15 cores can be activated at higher DVFS level to maximize
challenging. Therefore, exercising concurrent applications energy efficiency (IPS/Watt). Contrary, for memory-intensive
can make energy efficiency optimization a non- trivial task as workload, all A15 cores must be power gated while all A7
the number of possible scenarios will exponentially increases. cores can be activated at maximum DVFS level. Power gating
of big cores while exercising memory intensive workloads can
IV. WORKLOAD CLASSIFICATIONS drastically improve energy efficiency of heterogynous many
core systems due to their high power consumption when they
The categorization of workload classes defined in the present act as dark silicon.
work distinguishes between memory-intensive and CPU in-
tensive workloads, with low- or high-activity. Precisely, work- V. PROPOSED APPROACH
loads are categorized into the following listed four classes:
Our proposed paradigm interacts with runtime performance
• Class 0: low workloads activity; counter to appropriately compute classification metrics of the
exercised application, thereby application class can be deter-
• Class 1: intensive CPU workloads; mined. As a result, dark silicon cores can be power gated
according to various workload scenarios. Power gating of
• Class 2: intensive memory and CPU workloads; and unused cores can be achieved by using power switch net-
work (PSN) based CMOS transistors. This PSN is controlled

278 279 280 281 282 283 284 285 286 287 288