Page 283 - 2024-Vol20-Issue2
P. 283

279 |                                                                    Alrudainy, Marzook, Hussein & Shafik

TABLE III.
PERFORMANCE COUNTER METRICS, AND PCPG CONTROL DECISION BASED ON WORKLOAD CLASSIFICATION DETAIL.

Metrics                     Definitions                Metrics range       Classification  Freq.   A7    A15
  nipc            (InstRet/Cycles)(1/IPCmax)            urr [0, 0.11]     0: Low-activity  min    PCPG  PCPG
  iprc                                               nnmipc [0.35, 1]    1: CPU-intensive  Max    PCPG   max
                        InstRet/ClockRef            nnmipc [0.25, 0.35)  2: CPU+memory     min     max   max
nnmipc   ( 1/IPCmax) (InstRet/Cycles - Mem/Cycles)   nnmipc [0, 0.25)    3: mem-intensive  max     max  PCPG
  urr
                        Cycles/ClockRef

age/frequency combination in compliance with energy and                  • Class 3: intensive memory workloads.
performance requirements of the exercised workload.
Core allocations to threads (TM) are typically governed by a        Large-scale investigative experiments are exercised in our pre-
scheduler [28]. A Linux scheduler typically allocates the over-     vious published work in [11] to examine the rationality of
all workload across all available cores to attain substantial       these inclusive concepts. These investigations demonstrate
utilization. However, given a certain performance require-          that optimum energy efficiency can be achieved if fewer lit-
ment, various types of threads must be processed differently        tle cores are used from memory-intensive application while
for energy and performance optimization. For example, there         it is advantageous to exercise more big core in parallel for
is typically no differentiation about the specification of thread   CPU-intensive applications. The classification is achieved
being allocated, including memory- or CPU-intensive. Tack-          by computing a range of metrics from performance counter
ling energy efficiency in heterogeneous many core systems           readings, hence obtaining the classes relying on whether if
exercising concurrent workload behaviors requires a great deal      these metrics have traversed a predetermined threshold as can
of effort. This is because the state space is significantly large   be illustrated in Table III. The notable performance counter
and each workload demands different optimization. There-            events is listed as below:
fore, the hardware state space of a many-core heterogeneous
platform comprises all practicable DVFS combinations and                 • InstRet: is the retired executed instruction and is part
threads to core allocations (TM).                                          of the largely investigated instruction per cycles (IPC)
For instance, the number of feasible big.LITTLE core alloca-               metric.
tions of exercising single application is 19 as can be illustrated
in Fig. 2(a). Considering maximum of one thread per core is              • Cycles: is the number of clock cycle per core allocation.
permitted, each application must have at least one thread, and           • MEM ACCESS: is memory write or read operation that
one of the little cores is employed for running the operation
system. These possible scenarios are latterly multiplied by                causes a cache access to at least the level of data.
the potential range of DVFS combinations as shown in Fig.                • L1 CACHE: is the instruction cache access at level 1.
2(b), which is computed as MA15. MA7, where MA15 is the                 These metrics are calculated based on previous study
potential of DVFS combination in the A15 block, and MA7 is          published in [7].For instance, the more nnmipc rises, the
the potential of DVFS combination in the A7 block. Exercis-         more CPU-intensive workload becomes. Therefore, for CPU-
ing two applications concurrently requires 111 possible core        intensive workload all A7 cores must be powered off while
allocation making energy efficiency optimization is extremely       A15 cores can be activated at higher DVFS level to maximize
challenging. Therefore, exercising concurrent applications          energy efficiency (IPS/Watt). Contrary, for memory-intensive
can make energy efficiency optimization a non- trivial task as      workload, all A15 cores must be power gated while all A7
the number of possible scenarios will exponentially increases.      cores can be activated at maximum DVFS level. Power gating
                                                                    of big cores while exercising memory intensive workloads can
       IV. WORKLOAD CLASSIFICATIONS                                 drastically improve energy efficiency of heterogynous many
                                                                    core systems due to their high power consumption when they
The categorization of workload classes defined in the present       act as dark silicon.
work distinguishes between memory-intensive and CPU in-
tensive workloads, with low- or high-activity. Precisely, work-                  V. PROPOSED APPROACH
loads are categorized into the following listed four classes:
                                                                    Our proposed paradigm interacts with runtime performance
     • Class 0: low workloads activity;                             counter to appropriately compute classification metrics of the
                                                                    exercised application, thereby application class can be deter-
     • Class 1: intensive CPU workloads;                            mined. As a result, dark silicon cores can be power gated
                                                                    according to various workload scenarios. Power gating of
     • Class 2: intensive memory and CPU workloads; and             unused cores can be achieved by using power switch net-
                                                                    work (PSN) based CMOS transistors. This PSN is controlled
   278   279   280   281   282   283   284   285   286   287   288