Page 284 - 2024-Vol20-Issue2
P. 284
280 | Alrudainy, Marzook, Hussein & Shafik
by adopting a PCPG management routine. In the following DVFSDetermine: Compute: Monitors:
sections we briefly describe our approach, emphasizing the 1-Application class Classification 1-Performance counter
Exynos 5422 platform and PCPG based WLC interactions. (a) Hardware description of (b) PCPG based2-PCPG and DVFS2-Power
Exynos 5422 big.LITTLE WLC metricsmetrics
A. Performance Counter:
The Exynos 5422 big. LITTLE system on chip (SoC) board TM
has been chosen to validate our proposed approach, as shown PCPG management + Power switch network using CMOS
in Fig. 3(a). To activate the observing of power-performance
counter, we prepared a custom system routine compatible with PCPG PCPG
ARM’s technical specification document capable of record
diverse performance counter readings at pre-determined peri- 512k L2-Cashe 2M L2-Cashe with ECC
odic spans. This routine, along with its libraries, is presently
being intended for open release. In this work, performance 128-bit AMBAACE Coherent Bus Interface
counter is adopted to report system performance events in-
cluding, instruction retired cache misses, and cycles. Further, DRAM LPDDR3(933MHz)14.9 Gbytes/s
it is used to monitor temperature, current, voltage, and power
from the sensors in the Odroid-XU3 board by adopting the Fig. 3. (a) Hardware description of Samsung Odroid-XU3
approach provided by Walker et al. [29]. In this work, the platform; (b) Proposed PCPG based WLC metrics .
low activity class which leads to the significant dark silicon
power consumption, as the supply voltage and clock remain bits are assigned to 1 highlighting the advantages of power
operational, has been captured for different PCPG scenarios gating. These bits can then be adopted to activate the PSN
and frequencies. based CMOS transistors, thereby shutting down those dark
The two observations below can be summarized as follows. silicon related cores. In this work, PSN has been designed
Primarily, as the number of dark silicon related cores increase using Cadence Virtuoso tool box at 45nm technology node.
(big or LITTLE) the power consumption will drastically in- Practically, the maximum current drawn when CPU-intensive
crease. For instance, the power consumption of 4 big idle workload is exercised on the A15 core has been reported to
cores at 2 GHz is about 1.5 Watt, which decreases to roughly be Imax=1A per core at f=2000 MHz. Therefore, a number
0.6 Watt when only one big core acts as dark silicon area. of switch transistor is connected in parallel at their maximum
Secondly, the dark silicon related power consumption is also width to ensure providing maximum current of 1A to the ac-
reliant on the clock frequency. As an example, when parallel tive core [30]. The target impedance of the PSN that would
threads of cannel application are assigned to LITTLE cores be worked over a broad of frequency band, can be calculated
only, the dark silicon related power consumption of the idle by postulating a 5% allowable ripple in the core supply volt-
four big cores increases from about 0.6 Watt at 1400 MHz to age, and a 50% drawn current in the rise and fall time of the
almost 1.5 Watt at 2 GHz. processor clock [31]:
B. Power Gating Management and PSN: ztarget = 0.1 ×Vdd (1)
The proposed runtime PCPG based WLC coupled with DVFS I peak
control is performed according to Algorithm I. This algorithm
specifies the type of the workload, number of PCPG, and As a result, the power consumption overhead causing by adopt-
DVFS combinations for any exercised application. This can ing the PSN has been computed using Cadence tool for fair
be achieved through comparing observed reading from perfor- comparison.
mance counter to pre-set thresholds acquired from offline ex-
periments at design time. Depending on extensive experiment VI. RESULTS
from [11] [19], we categorize workloads by their process-
ing demands (CPU) and communication intensive (memory). For verification purposes five different set of applications
Therefore, workloads can be classified into three categories: have been adopted in this work as a case study. As expected
memory-intensive (MI), CPU-intensive (CI), and mix of CPU our PCPG based WLC achieves the highest energy efficiency
and memory-intensive (MIX). (IPS/Watt) improvement when streamcluster application is
A number of flag registers, on every time interval, are modified exercised, compared to the ondemand governor and previous
by the system software routine relying on the number of dark work published in [11]. This is because streamcluster appli-
silicon cores. For instance, when two big cores (core 4 and cation is a memory-intensive workload which prefer to be ex-
core 5) are acted as dark silicon area the corresponding flag ercised at little cores with lower DVFS level. This indicating