Page 284 - 2024-Vol20-Issue2
P. 284

280 |                                                                                  Alrudainy, Marzook, Hussein & Shafik

by adopting a PCPG management routine. In the following           DVFSDetermine:          Compute:            Monitors:
sections we briefly describe our approach, emphasizing the        1-Application class  Classification  1-Performance counter
Exynos 5422 platform and PCPG based WLC interactions.                                                                            (a) Hardware description of (b) PCPG based2-PCPG and DVFS2-Power
                                                                                                                                     Exynos 5422 big.LITTLE WLC metricsmetrics
A. Performance Counter:
The Exynos 5422 big. LITTLE system on chip (SoC) board            TM
has been chosen to validate our proposed approach, as shown           PCPG management + Power switch network using CMOS
in Fig. 3(a). To activate the observing of power-performance
counter, we prepared a custom system routine compatible with               PCPG                              PCPG
ARM’s technical specification document capable of record
diverse performance counter readings at pre-determined peri-      512k L2-Cashe                        2M L2-Cashe with ECC
odic spans. This routine, along with its libraries, is presently
being intended for open release. In this work, performance        128-bit AMBAACE Coherent Bus Interface
counter is adopted to report system performance events in-
cluding, instruction retired cache misses, and cycles. Further,            DRAM LPDDR3(933MHz)14.9 Gbytes/s
it is used to monitor temperature, current, voltage, and power
from the sensors in the Odroid-XU3 board by adopting the          Fig. 3. (a) Hardware description of Samsung Odroid-XU3
approach provided by Walker et al. [29]. In this work, the        platform; (b) Proposed PCPG based WLC metrics .
low activity class which leads to the significant dark silicon
power consumption, as the supply voltage and clock remain         bits are assigned to 1 highlighting the advantages of power
operational, has been captured for different PCPG scenarios       gating. These bits can then be adopted to activate the PSN
and frequencies.                                                  based CMOS transistors, thereby shutting down those dark
The two observations below can be summarized as follows.          silicon related cores. In this work, PSN has been designed
Primarily, as the number of dark silicon related cores increase   using Cadence Virtuoso tool box at 45nm technology node.
(big or LITTLE) the power consumption will drastically in-        Practically, the maximum current drawn when CPU-intensive
crease. For instance, the power consumption of 4 big idle         workload is exercised on the A15 core has been reported to
cores at 2 GHz is about 1.5 Watt, which decreases to roughly      be Imax=1A per core at f=2000 MHz. Therefore, a number
0.6 Watt when only one big core acts as dark silicon area.        of switch transistor is connected in parallel at their maximum
Secondly, the dark silicon related power consumption is also      width to ensure providing maximum current of 1A to the ac-
reliant on the clock frequency. As an example, when parallel      tive core [30]. The target impedance of the PSN that would
threads of cannel application are assigned to LITTLE cores        be worked over a broad of frequency band, can be calculated
only, the dark silicon related power consumption of the idle      by postulating a 5% allowable ripple in the core supply volt-
four big cores increases from about 0.6 Watt at 1400 MHz to       age, and a 50% drawn current in the rise and fall time of the
almost 1.5 Watt at 2 GHz.                                         processor clock [31]:

B. Power Gating Management and PSN:                               ztarget  =           0.1 ×Vdd                               (1)
The proposed runtime PCPG based WLC coupled with DVFS                                    I peak
control is performed according to Algorithm I. This algorithm
specifies the type of the workload, number of PCPG, and           As a result, the power consumption overhead causing by adopt-
DVFS combinations for any exercised application. This can         ing the PSN has been computed using Cadence tool for fair
be achieved through comparing observed reading from perfor-       comparison.
mance counter to pre-set thresholds acquired from offline ex-
periments at design time. Depending on extensive experiment                            VI. RESULTS
from [11] [19], we categorize workloads by their process-
ing demands (CPU) and communication intensive (memory).           For verification purposes five different set of applications
Therefore, workloads can be classified into three categories:     have been adopted in this work as a case study. As expected
memory-intensive (MI), CPU-intensive (CI), and mix of CPU         our PCPG based WLC achieves the highest energy efficiency
and memory-intensive (MIX).                                       (IPS/Watt) improvement when streamcluster application is
A number of flag registers, on every time interval, are modified  exercised, compared to the ondemand governor and previous
by the system software routine relying on the number of dark      work published in [11]. This is because streamcluster appli-
silicon cores. For instance, when two big cores (core 4 and       cation is a memory-intensive workload which prefer to be ex-
core 5) are acted as dark silicon area the corresponding flag     ercised at little cores with lower DVFS level. This indicating
   279   280   281   282   283   284   285   286   287   288   289