Revised: 13 Septemper 2024

Early View | December 2024

#### Iraqi Journal for Electrical and Electronic Engineering Original Article



### Understanding Power Gating Mechanism Based on Workload Classification of Modern Heterogeneous Many-Core Mobile Platform in the Dark Silicon Era

Haider Alrudainy\*<sup>1</sup>, Ali K. Marzook<sup>2</sup>, Muaad Hussein<sup>1</sup>, Rishad Shafik<sup>3</sup>
 <sup>1</sup>Basra Technical Engineering College, Southern Technical University, Basra, Iraq
 <sup>2</sup>School of Electrical Engineering, Basra University, Basra, Iraq
 <sup>3</sup>School of EEE, University of Newcastle, Newcastle upon Tyne, UK

Correspondance \*Haider Alrudainy Electronic Department, Basra Technical Institute, Basra, Iraq Email: h.m.a.alrudainy@stu.edu.iq

#### Abstract

The rapid progress in mobile computing necessitates energy efficient solutions to support substantially diverse and complex workloads. Heterogeneous many core platforms are progressively being adopted in contemporary embedded implementations for high performance at low power cost estimations. These implementations experience diverse workloads that offer drastic opportunities to improve energy efficiency. In this paper, we propose a novel per core power gating (PCPG) approach based on workload classifications (WLC) for drastic energy cost minimization in the dark silicon era. Core of our paradigm is to use an integrated sleep mode management based on workloads classification indicated by the performance counters. A number of real applications benchmark (PARSEC) are adopted as a practical example of diverse workloads, including memory- and CPU-intensive ones. In this paper, these applications are exercised on Samsung Exynos 5422 heterogeneous many core system showing up to 37% to 110% energy efficient when compared with our most recent published work, and ondemand governor, respectively. Furthermore, we illustrate low-complexity and low-cost runtime per core power gating algorithm that consistently maximize IPS/Watt at all state space.

#### Keywords

Dark Silicon, Energy-efficient, Multi-core Mobile System, Per Core Power Gating, Workload Classification.

#### **I. INTRODUCTION**

In the recent times, the continuing demand of low energy cost at desirable throughput has led to the advent of heterogynous many core mobile systems. These platforms, characterized by an ever rising number of cores on a single chip, provide significant computational capability. However, this increasing number of core incorporates with a significant set of challenges, profoundly emphasized by the emergence of dark silicon [1]. In the same context, continuing scaling the technology node according to Moore's Law has led to reach to a point at which large portion of the chip has to be shut down to avoid significant power consumption. It is demonstrated that at 22 nm technology node 21% of a chip must be powered off. While at 8nm technology the percentage of the dark silicon portion increases drastically to more than 50% [1] [2]. Other researchers show that 64% of the total 64-core chip has been observed as dark silicon [3–5]. Thus, it is predicted that the power consumption of many core platforms will be increased by a factor of 10 over the next decade due to the dark silicon phenomenon [6].

Unlike homogenous many-core systems, heterogeneous many-core platforms are widely being recently adopted in contemporary embedded mobile implementations. This is due to its superior energy efficiency at the desirable throughput compared to the homogenous cores counterparts. To alleviate the trade-offs between energy consumption and throughput



This is an open-access article under the terms of the Creative Commons Attribution License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2024 The Authors.

Published by Iraqi Journal for Electrical and Electronic Engineering | College of Engineering, University of Basrah.



| Reference | Architecture  | Verification       | Design abstraction  | Platform         | Key of Novelty        |  |
|-----------|---------------|--------------------|---------------------|------------------|-----------------------|--|
| [7]       | Homogeneous   | Hardware           | system              | Not specified    | Power gating,         |  |
| Γ, ]      |               |                    |                     |                  | Task mapping          |  |
| [8]       | Homogeneous   | Hardware           | Micro- architecture | ARM              | Sub-clock             |  |
|           |               |                    |                     | Cortex-M0        | Power gating,         |  |
| [9]       | Homogeneous   | Hardware           | system              | Intal Cora I7    | Power gating,         |  |
|           |               |                    |                     | Inter Core 17    | (Turbo boost)         |  |
| [10]      | Homogeneous   | Hardware           | system              | AMD Opteron 6168 | Power gating,         |  |
|           |               |                    |                     | AND Opteron 0108 | manually adjusted DVS |  |
| [11]      | Heterogeneous | Hardware           | system              | Odroid VII2      | Low complexity,       |  |
|           |               |                    |                     | Outoid-AUS       | Realtime+DVFS+TM      |  |
| [12–16]   | Heterogeneous | Hardware           | system              | Odroid-XU3       | Realtime +DVFS+TM     |  |
| [17]      | Heterogeneous | Simulink (Gem5)    | system              | ARM Cortex-A15   | Power modelling       |  |
| Proposed  | Heterogeneous | Hardware+          | aveter              | Odroid VII2      | PCPG based WLC        |  |
|           |               | Simulink (Cadence) | system              | Outoid-AU3       | +DVFS+TM              |  |

 TABLE I.

 Limitations And Features of the Present Approaches

a common approach is to assign heterogeneous computing resources (cores) on these platforms. Contemporary platform such as Samsung Exynos 5422 big. LITTLE octa cores system, which comprises 4 LITTLE (ARM A7) cores, and 4 big (ARM A15), is a reasonable choice of illustration in this work.

Over the past few years a substantial research has been conducted to meet energy cost reduction in heterogeneous embedded mobile platforms such as those from Arm and Intel [11–14] [18–20]. Such efforts normally manage dynamic voltage frequency scaling (DVFS) decisions, combined with the core allocation to threads to respond to workload variations. For instance, when a higher workload is experienced more number of cores are assigned with appropriately chosen DVFS combination. On the other hand, when a lower workload is encountered, fewer cores are allocated with decreased DVFS combination.

Although DVFS coupled with core allocation (TM) play a significant role in minimizing dynamic energy consumption in contemporary heterogynous many core systems, dark silicon contributes to significantly unuseful power consumption, principally decreasing the battery operating active time. To decrease the dark silicon energy consumption, power gating technique was adopted in many recent published research [7–10]. The fundamental key is to utilize a layer of sleep transistors to shut down the inactive cores by disconnecting the power supply voltage. In this paper, we propose a novel per power gating technique combined with DVFS and thread to core allocation in order to drastically decrease energy consumption. To the best of our knowledge, we consider that per core power gating (PCPG) based workload classification (WLC) performed on Samsung Exynos 5422 heterogeneous mobile

system to be a novel effort. In our proposed approach, the following major contributions has been made:

- propose a per core power gating (PCPG) approach for contemporary heterogeneous mobile platform based on workload classification to effectively support various workloads,
- core of the approach is an integrated power saving management for dark silicon area based on workload classification metrics, modeled adopting the performance counters feedback,
- validate by means of various type of real application benchmarks to illustrate reasonable superiority and trade-offs.

The rest of the paper is structured as follows. Section 2 limitation and features of the present approaches are extensively explained. The system architecture and application is comprehensively described in Section 3. Workload classification metrics obtained from performance counter, and PCPG control decision based on workload classification details have been demonstrated in Section 4. The proposed approach is expansively discussed in Section 5, deals with per core power gating management and power switch network. Section 6 discuss the results of the experiments, and, lastly, Section 7 provides the conclusion the paper.

#### **II. RELATED WORK**

Energy efficiency of many-core mobile platforms has been investigated expansively in recent years. Table I outlines limitations and contributions of the most recent existing approaches.

#### Over recent years significant research has been conducted addressing real-time energy reduction approaches. These techniques have taken into account single metric based optimization: mainly performance improvement within a particular power budget, or performance-constrained for power reduction [14]. For instance, real-time dynamic voltage frequency scaling (DVFS) control method for power reduction of manycore embedded platforms has been proposed in [15, 21-23]. Their method utilizes performance and user experience constraints to obtain the minimum DVFS combinations by adopting reinforcement learning and transfer principles. Others illustrated another power reduction method that models realtime workload analysis to constantly maintain the core allocations and DVFS combination through predictive controls using multinomial logic regression [16]. A number of research papers have also demonstrated analytical investigations adopting simulation frameworks, including McPAT, and gem5. These studies have utilized task mapping, DVFS, and offline optimization methods to significantly reduce the power dissipation under workloads variations [17, 24–26]. A novel work in [11] presented low complexity runtime management approach based on workload classification for heterogeneous many core platforms. This approach addresses most configuration space of odroid-xu3 platform including core types, threads allocation, optimum dynamic voltage and frequency scaling.

IJfee

277

A hardware based load balancing scheme for homogeneous many-core system is assessed in aspect of power consumption and thermal behavior [7]. In this scheme, a power minimization is reached by powering off the dark silicon area. In [8], to minimize static power consumption during the sub-clock cycle, a power gating based sub-clock approach was implemented in ARM Cortex-M0 processor. In the same context, Charles et al. [9] performed per core power gating (PCPG) in contemporary homogeneous Intel Core i7 processor. It is illustrated that additional power headroom can be transferred to the active cores by power gating dark silicon area, idle cores, to boost their frequency and voltage without overstep the thermal and power envelop. Likewise, transferring energy saving from dark silicon area into enabled cores was studied in [10] using a homogeneous many core platforms named as AMD Opteron 6168. The practical outcomes of this work are relied on manually adjustment of dynamic voltage scaling (DVS) combination integrated with per core power gating approach.

#### III. SYSTEM ARCHITECTURE AND APPLICATIONS

The impetus of adopting heterogeneous architectures, comprising two or various types of CPUs, is recently increasing. Although these platforms provide superior performance, it is essential to ensure optimum energy consumption while exercising various types of workloads. The Odroid-XU3 board facilitates approaches including affinity, DVFS, and core manually disabling, normally utilized to enhance system operation in respect of energy consumption and performance. The Odroid-XU3 board is a small heterogeneous 8-cores computational platform. This board can run Android 4.4 or Ubuntu 14.04 operating systems. The primary element of Odroid-XU3 board is the 28 nm Application Processor Exynos 5422. The main processor architecture depicted in Fig. 1. This multiprocessor system on chip (MPSoC) is developed by ARM big.LITTLE heterogeneous architecture and comprises of a low power Cortex-A7 quad core block, a high performance Cortex-A15 quad core processor block, 2GB DRAM LPDDR3, and a Mali-T628 GPU. Further, this board comprises of 4 real time current sensors that provide the opportunity to measure power consumption on the 4 separated power blocks: little (A7) CPUs, big (A15) CPUs, DRAM, and GPU. In addition, there are also 1 temperature sensor for the GPU and 4 temperature sensor for each of the A15 CPUs. The clock frequency and supply voltage (Vdd) of the Odroid-XU3 board, for each power block, can be adjusted using a range of pre-defined range of values. For example, the low power Cortex-A7 quad core block has a set of frequencies ranged between 200 MHz and 1400 MHz with a step size of 100 MHz, while the performance Cortex-A15 quad core block features a set of frequencies ranged between 200 MHz and 2 GHz with a step size equal to 100MHz.

The PARSEC real application benchmark suite supports both emerging and current workloads for multi processing hardware [27]. It contains a various set of workloads from diverse domains including systems applications or interactive animation that mimic large-scale commercial workloads. In our paper, Therefore, PARSEC applications has been adopted and exercised on the Odroid-XU3 system on chip (SoC) whose heterogeneity can be illustrative of various design choices that can significantly impact workloads. PARSEC benchmark suite experience diverse: data sharing patterns, workload partitions, and memory behaviors from majority other benchmark suites in widespread use. Table II shows the char-

| Application   | Domain            | Туре    |  |
|---------------|-------------------|---------|--|
| ferret        | Similarity Search | CPU     |  |
| cannel        | Engineering       | CPU     |  |
| bodytrack     | Computer Vision   | CPU+mem |  |
| streamcluster | Data Mining       | mem     |  |
| fluidanimate  | Animation         | mem     |  |

 TABLE II.

 Characteristic of Parsec Benchmark [27]

# 278 | **IJ<sub>EEE</mark>**</sub>



Fig. 1. Odroid-XU3 board comprising Samsung Exynos 5422 heterogeneous MPSoC



Fig. 2. a) number of potential core allocations of exercising single application; b) experimental data of cannel application demonstrating the number of DVFS combination when exercised on four big (A15) and four little (A7) cores.

acteristics of PARSEC benchmark suit which are adopted in our work. Three set of applications (ferret/cannel, fluidanimate/streamcluster, and bodytrack) are opted to illustrate CPU-intensive, memory-intensive, and mixed memory with CPU-intensive, respectively.

DVFS of odroid-xu3 platform is enabled by the power governors at the system software layer. For example, Linux incorporates various power governors that can be actuated based on the system demands. These comprise powersave for low performance and low power mode, performance for higher performance mode, ondemand for performance-sensitive DVFS level, and userspace for user-customized DVFS combination. These governors aim to appropriately adjust the voltage/frequency combination in compliance with energy and performance requirements of the exercised workload. Core allocations to threads (TM) are typically governed by a scheduler [28]. A Linux scheduler typically allocates the overall workload across all available cores to attain substantial

# 279 | **IJ<sub>EEE</mark>**</sub>

#### TABLE III.

PERFORMANCE COUNTER METRICS, AND PCPG CONTROL DECISION BASED ON WORKLOAD CLASSIFICATION DETAIL.

| Metrics | Definitions                                                  | Metrics range       | Classification   | Freq. | A7   | A15  |
|---------|--------------------------------------------------------------|---------------------|------------------|-------|------|------|
| nipc    | (InstRet/Cycles)(1/IPC <sub>max</sub> )                      | urr [0, 0.11]       | 0: Low-activity  | min   | PCPG | PCPG |
| iprc    | InstRet/ClockRef                                             | nnmipc [0.35, 1]    | 1: CPU-intensive | Max   | PCPG | max  |
| nnmipc  | (1/ <i>IPC<sub>max</sub></i> ) (InstRet/Cycles - Mem/Cycles) | nnmipc [0.25, 0.35) | 2: CPU+memory    | min   | max  | max  |
| urr     | Cycles/ClockRef                                              | nnmipc [0, 0.25)    | 3: mem-intensive | max   | max  | PCPG |

utilization. However, given a certain performance requirement, various types of threads must be processed differently for energy and performance optimization. For example, there is typically no differentiation about the specification of thread being allocated, including memory- or CPU-intensive. Tackling energy efficiency in heterogeneous many core systems exercising concurrent workload behaviors requires a great deal of effort. This is because the state space is significantly large and each workload demands different optimization. Therefore, the hardware state space of a many-core heterogeneous platform comprises all practicable DVFS combinations and threads to core allocations (TM).

For instance, the number of feasible big.LITTLE core allocations of exercising single application is 19 as can be illustrated in Fig. 2(a). Considering maximum of one thread per core is permitted, each application must have at least one thread, and one of the little cores is employed for running the operation system. These possible scenarios are latterly multiplied by the potential range of DVFS combinations as shown in Fig. 2(b), which is computed as MA15. MA7, where MA15 is the potential of DVFS combination in the A15 block, and MA7 is the potential of DVFS combination in the A7 block. Exercising two applications concurrently requires 111 possible core allocation making energy efficiency optimization is extremely challenging. Therefore, exercising concurrent applications can make energy efficiency optimization a non- trivial task as the number of possible scenarios will exponentially increases.

#### **IV. WORKLOAD CLASSIFICATIONS**

The categorization of workload classes defined in the present work distinguishes between memory-intensive and CPU intensive workloads, with low- or high-activity. Precisely, workloads are categorized into the following listed four classes:

- Class 0: low workloads activity;
- Class 1: intensive CPU workloads;
- · Class 2: intensive memory and CPU workloads; and
- Class 3: intensive memory workloads.

Large-scale investigative experiments are exercised in our previous published work in [11] to examine the rationality of these inclusive concepts. These investigations demonstrate that optimum energy efficiency can be achieved if fewer little cores are used from memory-intensive application while it is advantageous to exercise more big core in parallel for CPU-intensive applications. The classification is achieved by computing a range of metrics from performance counter readings, hence obtaining the classes relying on whether if these metrics have traversed a predetermined threshold as can be illustrated in Table III. The notable performance counter events is listed as below:

- InstRet: is the retired executed instruction and is part of the largely investigated instruction per cycles (IPC) metric.
- Cycles: is the number of clock cycle per core allocation.
- MEM\_ACCESS: is memory write or read operation that causes a cache access to at least the level of data.
- L1\_CACHE: is the instruction cache access at level 1.

These metrics are calculated based on previous study published in [7].For instance, the more nnmipc rises, the more CPU-intensive workload becomes. Therefore, for CPUintensive workload all A7 cores must be powered off while A15 cores can be activated at higher DVFS level to maximize energy efficiency (IPS/Watt). Contrary, for memory-intensive workload, all A15 cores must be power gated while all A7 cores can be activated at maximum DVFS level. Power gating of big cores while exercising memory intensive workloads can drastically improve energy efficiency of heterogynous many core systems due to their high power consumption when they act as dark silicon.

#### V. PROPOSED APPROACH

Our proposed paradigm interacts with runtime performance counter to appropriately compute classification metrics of the exercised application, thereby application class can be determined. As a result, dark silicon cores can be power gated according to various workload scenarios. Power gating of unused cores can be achieved by using power switch network (PSN) based CMOS transistors. This PSN is controlled by adopting a PCPG management routine. In the following sections we briefly describe our approach, emphasizing the Exynos 5422 platform and PCPG based WLC interactions.



Fig. 3. (a) Hardware description of Samsung Odroid-XU3 platform; (b) Proposed PCPG based WLC metrics .

#### A. Performance Counter:

The Exynos 5422 big. LITTLE system on chip (SoC) board has been chosen to validate our proposed approach, as shown in Fig. 3(a). To activate the observing of power-performance counter, we prepared a custom system routine compatible with ARM's technical specification document capable of record diverse performance counter readings at pre-determined periodic spans. This routine, along with its libraries, is presently being intended for open release. In this work, performance counter is adopted to report system performance events including, instruction retired cache misses, and cycles. Further, it is used to monitor temperature, current, voltage, and power from the sensors in the Odroid-XU3 board by adopting the approach provided by Walker et al. [29]. In this work, the low activity class which leads to the significant dark silicon power consumption, as the supply voltage and clock remain operational, has been captured for different PCPG scenarios and frequencies.

The two observations below can be summarized as follows. Primarily, as the number of dark silicon related cores increase (big or LITTLE) the power consumption will drastically increase. For instance, the power consumption of 4 big idle cores at 2 GHz is about 1.5 Watt, which decreases to roughly 0.6 Watt when only one big core acts as dark silicon area. Secondly, the dark silicon related power consumption is also reliant on the clock frequency. As an example, when parallel threads of cannel application are assigned to LITTLE cores only, the dark silicon related power consumption of the idle four big cores increases from about 0.6 Watt at 1400 MHz to almost 1.5 Watt at 2 GHz.

#### Algorithm 1 PCPG and DVFS Based WLC Metrics.

**Input:** power, performance counter readings including (unhalted CPU cycles, memory access, instruction retired);

**Input:** Parameters:  $urr_H=0.11$ ,  $nnmipc_H=0.35$ ,  $nnmipc_L=0.25$ ;

Output: WLC type, PCPG, and DVFS;

- Compute: urr, and nnmipc;
- 1: If: urr  $\leq urr_H$ ;
- 2: WLC type = LA;  $\mapsto$  (Class 0: Low-Activity)
- 3: Allocated\_ single LITTLE core;
- 4: DVFS  $f_{A7}$ =Min.;
- 5: Else if: nnmipc >  $nnmipc_H$ ;
- 6: WLC type = CI;  $\mapsto$  (Class1: CPU-intensive)
- 7: Allocated\_cores  $A_{15}$  cores alone;
- 8: DVFS *f*<sub>A15</sub>=Max.;
- 9: Else if:  $nnmipc_L < nnmipc_H$ ;
- 10: WLC type = Mixed;  $\mapsto$  (Class2: Combination)
- 11: Allocated\_cores big. LITTLE cores;
- 12: DVFS  $f_{A15}$ =Max. &  $f_{A7}$ = Max.;
- 13: Else if: nnmipc  $< nnmipc_L$ ;
- 14: WLC type = MI;  $\mapsto$  (Class3: memory intensive)
- 15: Allocated\_cores A7 cores alone;
- 16: DVFS  $f_{A7}$ =Max.;
- 17: end if;

#### **B.** Power Gating Management and PSN:

The proposed runtime PCPG based WLC coupled with DVFS control is performed according to Algorithm I. This algorithm specifies the type of the workload, number of PCPG, and DVFS combinations for any exercised application. This can be achieved through comparing observed reading from performance counter to pre-set thresholds acquired from offline experiments at design time. Depending on extensive experiment from [11] [19], we categorize workloads by their processing demands (CPU) and communication intensive (memory). Therefore, workloads can be classified into three categories: memory-intensive (MI), CPU-intensive (CI), and mix of CPU and memory-intensive (MIX).

A number of flag registers, on every time interval, are modified by the system software routine relying on the number of dark silicon cores. For instance, when two big cores (core 4 and core 5) are acted as dark silicon area the corresponding flag bits are assigned to 1 highlighting the advantages of power gating. These bits can then be adopted to activate the PSN based CMOS transistors, thereby shutting down those dark silicon related cores. In this work, PSN has been designed using Cadence Virtuoso tool box at 45nm technology node. Practically, the maximum current drawn when CPU-intensive workload is exercised on the A15 core has been reported to be Imax=1A per core at f=2000 MHz. Therefore, a number of switch transistor is connected in parallel at their maximum width to ensure providing maximum current of 1A to the active core [30]. The target impedance of the PSN that would



Fig. 4. IPS/Watt measured for the proposed PCPG based WLC, MLR+WLC, and only ondemand governor exercised on odroid XU3 platform.

be worked over a broad of frequency band, can be calculated by postulating a 5% allowable ripple in the core supply voltage, and a 50% drawn current in the rise and fall time of the processor clock [31]:

$$z_{target} = \frac{0.1 \times V_{dd}}{I peak} \tag{1}$$

As a result, the power consumption overhead causing by adopting the PSN has been computed using Cadence tool for fair comparison.

#### VI. RESULTS

For verification purposes five different set of applications have been adopted in this work as a case study. As expected our PCPG based WLC achieves the highest energy efficiency (IPS/Watt) improvement when streamcluster application is exercised, compared to the ondemand governor and previous work published in [11]. This is because streamcluster application is a memory-intensive workload which prefer to be exercised at little cores with lower DVFS level. This indicating the advantages of switching off the big cores which causing the highest dark silicon related power consumption. These results illustrate about 110% improvements of energy-efficiency (IPS/Watt) over the ondemand governor, while improvement of 37% has been reported in comparison to WLC+MLR approach published in [11].

Exercising of cannel, CPU-intensive application, using our proposed approach can achieve 10 to 100% improvement over the WLC+MLR and ondemand governor, respectively. Mixed memory and CPU-intensive concurrent application shows slighter energy efficiency improvements of 12 to 2% for cannel+Bodytrack and cannel+streamcluster respectively. This is attributed to the fact that application behavior fluctuated from one class to another causing of high energy consumption in the power switch network which outweigh its energy saving at some interval of exercising the application. Therefore, our proposed approach can achieve significant energy saving on

#### **VII.** CONCLUSIONS

minor fluctuated application, which rarely move from one

class to other.

Emerging of heterogamous many-core platform offers promising solution for ever increasing demands of energy-efficient mobile computing system. Using power gating technique based workload classification coupled with core allocation and DVFS, these platforms can effectively minimize energy consumption and optimize resource utilization. Thereby, contributing to significantly mitigate dark silicon effect occurring due to an implementing of Moore's Law technology scaling. In this work, IPS/Watt has been improved 37 to 110% for memory intensive workload compared to published work in [11], and ondemand governor, respectively.

In conclusion, the combination of heterogeneous manycore platform and workload classification plays substantial role in revolutionize the way we approach energy-efficient computing, making it a fundamental driver for a more sustainable and powerful future in the world of computing technology. For future work, PCPG based WLC can be implemented on GPU to drastically reduce energy consumption, hence significant energy saving can be improved.

#### **CONFLICT OF INTEREST**

The authors have no conflict of relevant interest to this article can be used.

#### REFERENCES

- H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in *Proceedings of the 38th annual international symposium on Computer architecture*, pp. 365–376, 2011.
- [2] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," *IEEE micro*, vol. 32, no. 3, pp. 122– 134, 2012.

### <sup>282</sup> | **IJ<sub>EEE</mark>**</sub>

- [3] A. Shafaei Bejestan, Y. Wang, S. Ramadurgam, Y. Xue, P. Bogdan, and M. Pedram, "Analyzing the dark silicon phenomenon in a many-core chip multi-processor under deeply-scaled process technologies," in *Proceedings of the 25th edition on Great Lakes Symposium on VLSI*, pp. 127–132, 2015.
- [4] J. Henkel, H. Bukhari, S. Garg, M. U. K. Khan, H. Khdr, F. Kriebel, U. Ogras, S. Parameswaran, and M. Shafique, "Dark silicon: From computation to communication," in *Proceedings of the 9th International Symposium on Networks-on-Chip*, pp. 1–8, 2015.
- [5] X. Wang, A. K. Singh, B. Li, Y. Yang, H. Li, and T. Mak, "Bubble budgeting: Throughput optimization for dynamic workloads by exploiting dark cores in many core systems," *IEEE Transactions on Computers*, vol. 67, no. 2, pp. 178–192, 2017.
- [6] X. Wang, B. Zhao, L. Wang, T. Mak, M. Yang, Y. Jiang, and M. Daneshtalab, "A pareto-optimal runtime power budgeting scheme for many-core systems," *Microprocessors and Microsystems*, vol. 46, pp. 136–148, 2016.
- [7] E. Musoll, "Hardware-based load balancing for massive multicore architectures implementing power gating," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 29, no. 3, pp. 493–497, 2010.
- [8] J. N. Mistry, B. M. Al-Hashimi, D. Flynn, and S. Hill, "Sub-clock power-gating technique for minimising leakage power during active mode," in 2011 Design, Automation & Test in Europe, pp. 1–6, IEEE, 2011.
- [9] J. Charles, P. Jassi, N. S. Ananth, A. Sadat, and A. Fedorova, "Evaluation of the intel® core<sup>™</sup> i7 turbo boost feature," in 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 188–197, IEEE, 2009.
- [10] K. Ma and X. Wang, "Pgcapping: Exploiting power gating for power capping and core lifetime balancing in cmps," in *Proceedings of the 21st international conference on Parallel architectures and compilation techniques*, pp. 13–22, 2012.
- [11] A. Aalsaud, F. Xia, A. Rafiev, R. Shafik, A. Romanovsky, and A. Yakovlev, "Low-complexity run-time management of concurrent workloads for energy-efficient multicore systems," *Journal of Low Power Electronics and Applications*, vol. 10, no. 3, p. 25, 2020.
- [12] S. Tzilis, P. Trancoso, and I. Sourdis, "Energy-efficient runtime management of heterogeneous multicores using

Alrudainy, Marzook, Hussein & Shafik

online projection," *ACM Transactions on Architecture and Code Optimization (TACO)*, vol. 15, no. 4, pp. 1–26, 2019.

- [13] A. K. Singh, A. Prakash, K. R. Basireddy, G. V. Merrett, and B. M. Al-Hashimi, "Energy-efficient run-time mapping and thread partitioning of concurrent opencl applications on cpu-gpu mpsocs," *ACM Transactions on Embedded Computing Systems (TECS)*, vol. 16, no. 5s, pp. 1–22, 2017.
- [14] C. Hankendi and A. K. Coskun, "Adaptive power and resource management techniques for multi-threaded workloads," in 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 2302–2305, IEEE, 2013.
- [15] R. A. Shafik, S. Yang, A. Das, L. A. Maeda-Nunez, G. V. Merrett, and B. M. Al-Hashimi, "Learning transferbased adaptive energy minimization in embedded systems," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 35, no. 6, pp. 877– 890, 2015.
- [16] A. Das, A. Kumar, B. Veeravalli, R. Shafik, G. Merrett, and B. Al-Hashimi, "Workload uncertainty characterization and adaptive frequency scaling for energy minimization of embedded systems," in 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 43–48, IEEE, 2015.
- [17] B. K. Reddy, M. J. Walker, D. Balsamo, S. Diestelhorst, B. M. Al-Hashimi, and G. V. Merrett, "Empirical cpu power modelling and estimation in the gem5 simulator," in 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PAT-MOS), pp. 1–8, IEEE, 2017.
- [18] A. Aalsaud, A. Rafiev, F. Xia, R. Shafik, and A. Yakovlev, "Model-free runtime management of concurrent workloads for energy-efficient many-core heterogeneous systems," in 2018 28th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 206–213, IEEE, 2018.
- [19] A. Aalsaud, R. Shafik, A. Rafiev, F. Xia, S. Yang, and A. Yakovlev, "Power–aware performance adaptation of concurrent applications in heterogeneous many-core systems," in *Proceedings of the 2016 International Symposium on Low Power Electronics and Design*, pp. 368– 373, 2016.
- [20] S. K. Mandal, G. Bhat, J. R. Doppa, P. P. Pande, and U. Y. Ogras, "An energy-aware online learning framework

## 283 | **IJ<sub>EEE</mark>**</sub>

for resource management in heterogeneous platforms," *ACM Transactions on Design Automation of Electronic Systems (TODAES)*, vol. 25, no. 3, pp. 1–26, 2020.

- [21] H. M. Alrudainy, A. Mokhov, F. Xia, and A. Yakovlev, "Ultra-low energy data driven computing using asynchronous micropipelines and nano-electro-mechanical relays," in 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 158–163, 2017.
- [22] H. A. Leftah and M. H. Al-Ali, "Index modulated spread spectrum ofdm with c-transform," *IEEE Communications Letters*, vol. 25, no. 9, pp. 3119–3123, 2021.
- [23] M. Al-Momin, I. A. Abed, and H. A. Leftah, "A new approach for enhancing lsb steganography using bidirectional coding scheme," *International Journal of Electrical and Computer Engineering (IJECE)*, vol. 9, no. 6, pp. 5286–5294, 2019.
- [24] H. Alrudainy, A. Mokhov, and A. Yakovlev, "A scalable physical model for nano-electro-mechanical relays," in 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 1–7, 2014.
- [25] H. Alrudainy, A. Mokhov, N. S. Dahir, and A. Yakovlev, "Mems-based power delivery control for bursty applications," in 2016 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 790–793, 2016.
- [26] H. Alrudainy, R. Shafik, A. Mokhov, and A. Yakovlev, "Lifetime reliability characterization of n/mems used in power gating of digital integrated circuits," in 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6, 2017.
- [27] C. Bienia and K. Li, "Parsec 2.0: A new benchmark suite for chip-multiprocessors," in *Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation*, vol. 2011, p. 37, 2009.
- [28] A. Torrey, J. Cleman, and P. Miller, "Comparing interactive scheduling in linux," *Software-Practices & Experience*, vol. 34, no. 4, pp. 347–364, 2007.
- [29] M. J. Walker, S. Diestelhorst, A. Hansson, A. K. Das, S. Yang, B. M. Al-Hashimi, and G. V. Merrett, "Accurate and stable run-time power modeling for mobile and embedded cpus," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 36, no. 1, pp. 106–119, 2016.

- [30] B. Amelifard and M. Pedram, "Optimal selection of voltage regulator modules in a power delivery network," in *Proceedings of the 44th annual Design Automation Conference*, pp. 168–173, 2007.
- [31] A. Aalsaud, H. Alrudainv, R. Shafik, F. Xia, and A. Yakovlev, "Mems-based runtime idle energy minimization for bursty workloads in heterogeneous manycore systems," in 2018 28th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 198–205, 2018.