Spreading the heat: Multi-cloud controller for failover and cross-site offloading [AINA 2020]

Simon Kollberg, Ewnetu Bayuh Lakew, Petter Svard, Erik Elmroth, Johan Tordsson

Despite the ubiquitous adoption of cloud computing and a very rich set of services offered by cloud providers, current systems lack efficient and flexible mechanisms to collaborate among multiple cloud sites. In order to guarantee resource availability during peaks in demand and to fulfill service level objectives, cloud service providers cap resource allocations and as a consequence, face severe underutilization during non-peak periods. In addition, application owners are forced to make independent contracts to deploy their application at different sites. To illustrate how these shortcomings can be overcome, we present a lightweight cross-site offloader for OpenStack. Our controller utilizes templates and site weights to enable offloading of virtual machines between geographically disperse sites. We present and implement a proposed architecture and demonstrate its feasibility in both a typical cross-site offloading, as well as a failover scenario.

Published at:
34th International Conference on Advanced Information Networking and Applications
(AINA 2020)

NUMAscope: Capturing and Visualizing Hardware Metrics on Large ccNUMA Systems [Technical Report]

Daniel Blueman, Foivos Zakkak, Christos Kotselidis

Cache-coherent non-uniform memory access (ccNUMA) systems enable parallel applications to scale-up to thousands of cores and many terabytes of main memory. However, since remote accesses come at an increased cost, extra measures are necessitated to scale the applications to high core-counts and process far greater amounts of data than a typical server can hold. In a similar manner to how applications are optimized to improve cache utilization, applications also need to be optimized to improve data-locality on ccNUMA systems to use larger topologies effectively. The first step to optimizing an application is to understand what slows it down. Consequently, profiling tools, or manual instrumentation, are necessary to achieve this. When optimizing applications on large ccNUMA systems, however, there are limited mechanisms to capture and present actionable telemetry. This is partially driven by the proprietary nature of such interconnects, but also by the lack of development of a common and accessible (read open-source) framework that developers or vendors can leverage. In this paper, we present an open-source, extensible framework that captures high-rate on-chip events with low overhead (<10% single-core utilization). The presented framework at can be ran either on live or record mode, allowing both live monitoring or post-mortem analysis of the measurements. The visualization of the measurements can be done either through an interactive graphical interface or through a convenient textual interface for quick-look analysis.

Published at:
Technical Report

ACTiManager: An end-to-end interference-aware cloud resource manager [Demo] [Middleware 2019]

Stratos Psomadakis, Stefanos Gerangelos, Dimitrios Siakavaras, Ioannis Papadakis, Marina Vemmou, Aspa Skalidi, Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris, Georgios Goumas

Cloud service providers (CSPs) rely mostly on simplistic and conservative policies regarding resource management, to minimize interference of shared resources between multiple VMs and to provide acceptable performance. However, such approaches may lead to suboptimal allocation and resource underutilization. In this demonstration we present ACTiManager, an end-to-end interference-aware manager for cloud resources. Our preliminary results compared to vanilla OpenStack are promising in terms of CSPs' profit while also retaining average user's satisfaction in the set of top priorities.

Published at:
20th ACM/IFIP International Conference Middleware 2019
(Middleware 2019)

An Analysis of Call-Site Patching without Strong Hardware Support for Self-Modifying-Code [MPLR 2019]

Tim Hartley, Foivos S. Zakkak, Christos Kotselidis, Mikel Luján

With micro-services continuously gaining popularity and low-power processors making their way into data centers, efficient execution of managed runtime systems on low-power architectures is also gaining interest. Apart from the inherent performance differences between high and low power processors, porting a managed runtime system to a low-power architecture may result in spuriously introducing additional overheads and design trade-offs. In this work we investigate how the lack of strong hardware support for Self Modifying Code (SMC) in low-power architectures, influences Just-In-Time (JIT) compilation and execution in modern virtual machines. In particular, we examine how low-power architectures, with no or limited hardware support for SMC, impose restrictions on call-site implementations, when the latter need to be patchable by the runtime system. We present four different memory-safe implementations for call-site generation and discuss their advantages and disadvantages in the absence of strong hardware support for SMC. Finally, we evaluate each technique on different workloads using micro-benchmarks and we evaluate the best two techniques on the Dacapo benchmark suite showcasing performance differences up to 15%

Published at:
Managed Programming Languages and Runtimes
(MPLR 2019)

Simulating Wear-out Effects of Asymmetric Multicores at the Architecture Level [DFT 2019]

Nikos Foutris, Christos Kotselidis, Mikel Luján

As the silicon industry moves into deep nanoscale technologies, preserving Mean Time to Failure at acceptable levels becomes a first-order challenge. The operational stress, along with the inefficient power dissipation and the unsustainable thermal thresholds increase the wear-induced failures. As a result, faster wear-out leads to earlier performance degradation with eventual device breakdown. Furthermore, the proliferation of asymmetric multicores is tightly coupled with an increasing susceptibility to variable wear-out rate within the components of processors. This paper investigates the reliability boundaries of asymmetric multicores, which span from embedded systems to high performance computing domains, by performing a continuous operation reliability assessment. As our experimental analysis illustrates, the variation between the least and the most aged hardware resource equals to 2.6 years. Motivated by this finding, we show that an MTTF-aware, asymmetric configuration prolongs its lifetime by 21%.

Published at:
32nd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems
(DFT 2019)

DICER: Diligent Cache Partitioning for Efficient Workload Consolidation [ICPP 2019]

Konstantinos Nikas, Nikela Papadopoulou, Dimitra Giantsidi, Vasileios Karakostas, Georgios Goumas, Nectarios Koziris

Workload consolidation has been shown to achieve improved resource utilisation in modern datacentres. In this paper we focus on the extended problem of allocating resources when co-locating High-Priority (HP) and Best-Effort (BE) applications. Current approaches either neglect this prioritisation and focus on maximising the utilisation of the server or favour HP execution resulting to severe performance degradation for BEs. We propose DICER, a novel, practical, dynamic cache partitioning scheme that adapts the LLC allocation to the needs of the HP and assigns spare cache resources to the BEs. Our evaluation reveals that DICER successfully increases the system's utilisation, while at the same time minimising the impact of co-location on HP's performance.

Published at:
48th International Conference on Parallel Processing
(ICPP 2019)

Profiling and Tracing Support for Java Applications [ICPE 2019]

Andrew Nisbet, Nuno Miguel Nobre, Graham Riley, Mikel Luján

We demonstrate the feasibility of undertaking performance evaluations for JVMs using:(1) a hybrid JVM/OStool, such as async-profiler, (2) OS centric profiling and tracing tools based on Linux perf, and (3) the Extended Berkeley Packet Filter Tracing (eBPF) framework where we demonstrate the rationale behind the standard offwaketime tool, for analysing the causes of blocking latencies, and our own eBPF-based tool bcc-java, that relates changes in microarchitecture performance counter values to the execution of individual JVM and application threads at low overhead. The relative execution time overheads of the performance tools are illustrated for the DaCapo-bach-9.12 benchmarks with OpenJDK9 on an Intel Xeon E5-2690, running Ubuntu 16.04. Whereas sampling based tools can have up to 25{\%} slowdown using 4kHz frequency, our tool bcc-java has a geometric mean of less than 5{\%}. Only for the avrora benchmark, bcc-java has a significant overhead (37{\%}) due to an unusually high number of futex system calls. Finally, we provide a discussion on the recommended approaches to solve specific performance use-case scenarios.

Published at:
10th ACM/SPEC International Conference on Performance Engineering
(ICPE 2019)

SQALPEL: A database performance platform [CIDR 2019]

M.L. Kersten, P. Koutsourakis, S. Manegold, Y. Zhang

Despite their popularity, database benchmarks only highlight a small fraction of the capabilities of any given DBMS. They often do not highlight problematic components encountered in real life database applications or provide hints for further research and engineering. To alleviate this problem we coined discriminative performance benchmarking as the way to go. It aids in exploring a larger query search space to find performance outliers and their underlying cause. The approach is based on deriving a domain specific language from a sample complex query to identify and execute a query workload. The demo illustrates sqalpel, a complete platform to collect, manage and selectively disseminate performance facts, that enables repeatability studies, and economy of scale by sharing performance experiences.

Published at:
9th biennial Conference on Innovative Data Systems Research
(CIDR 2019)

Database Resource Allocation Based on Resilient Intermediates [XtremeCLOUD 2018]

Martin Kersten, Ying Zhang, Pavlos Katsogridakis, Panagiotis Koutsourakis, Joeri van Ruth

Scale-out of big data analytics applications often does not pay off due to the poor performance in response time and the increasing bill due to a longer execution time on a resource limited machine. To enable a stable DBMS workload environment it helps to maintain several virtual machines with difference resource configurations (CPU, memory, disk, etc) hosting part of the database, so that users can send their tasks to those machines that have the best price/performance characteristics. This, however, requires a method to decide which VM should be used for a given query. When choosing the VM, the memory usage of a query is a particularly important factor, especially for the main-memory (optimised) DBMSs which are generally used for analytical queries today. In this paper, we introduce MALCOM, a memory footprint predictor for queries based on resilient intermediates in MonetDB. Unlike traditional cost-based approaches, MALCOM uses an empirical approach (i.e. using the memory usage information of queries executed in the past) to incrementally update its model to improve its predictions. Our preliminary experiment results show that this approach is robust against varying data distributions.

Published at:
1st International Workshop on Next Generation Clouds for Extreme Data
(XtremeCLOUD 2018)

Performance Prediction of NUMA Placement: A Machine-Learning Approach [XtremeCLOUD 2018]

Fanourios Arapidis, Vasileios Karakostas, Nikela Papadopoulou, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

In this paper we present a machine-learning approach to predict the impact on performance of core and memory placement in non-uniform memory access (NUMA) systems. The impact on performance depends on the architecture and the application’s characteristics. We focus our study on features that can be easily extracted with hardware performance counters that are found in commodity off-the-self systems. We run various single-threaded benchmarks from Spec2006 and Parsec under different placement scenarios, and we use this benchmarking data to train multiple regression models that could serve as performance predictors. Our experimental results show notable accuracy in predicting the impact on performance with relatively simple prediction models.

Published at:
1st International Workshop on Next Generation Clouds for Extreme Data
(XtremeCLOUD 2018)

Utility-based Allocation of Industrial IoT Applications in Mobile Edge Clouds [IPCCC2018]

Amardeep Mehta, Ewnetu Bayuh Lakew, Johan Tordsson, Erik Elmroth

Mobile Edge Clouds (MECs) create new opportunities and challenges in terms of scheduling and running applications that have a wide range of latency requirements, such as intelligent transportation systems, process automation, and smart grids. We propose a two-tier scheduler for allocating runtime resources to Industrial Internet of Things (IIoT) applications in MECs. The scheduler at the higher level runs periodically – monitors system state and the performance of applications – and decides whether to admit new applications and migrate existing applications. In contrast, the lower-level scheduler decides which application will get the runtime resource next. We use performance based metrics that tells the extent to which the runtimes are meeting the Service Level Objectives (SLOs) of the hosted applications. The Application Happiness metric is based on a single application’s performance and SLOs. The Runtime Happiness metric is based on the Application Happiness of the applications the runtime is hosting. These metrics may be used for decision-making by the scheduler, rather than runtime utilization, for example. We evaluate four scheduling policies for the high-level scheduler and five for the low-level scheduler. The objective for the schedulers is to minimize cost while meeting the SLO of each application. The policies are evaluated with respect to the number of runtimes, the impact on the performance of applications and utilization of the runtimes. The results of our evaluation show that the high-level policy based on Runtime Happiness combined with the low-level policy based on Application Happiness outperforms other policies for the schedulers, including the bin packing and random strategies. In particular, our combined policy requires up to 30% fewer runtimes than the simple bin packing strategy and increases the runtime utilization up to 40% for the Edge Data Center (DC) in the scenarios we evaluated.

Published at:
37th IEEE International Performance Computing and Communications Conference

SmallTail: Scaling Cores and Probabilistic Cloning Requests for Web Systems [ICAC 2018]

E. B. Lakew, R. Birke, J. F. Perez, E. Elmroth, L. Y. Chen

Users quality of experience on web systems are largely determined by the tail latency, e.g., 95 th percentile. Scaling resources along, e.g., the number of virtual cores per VM, is shown to be effective to meet the average latency but falls short in taming the latency tail in the cloud where the performance variability is higher. The prior art shows the prominence of increasing the request redundancy to curtail the latency either in the off-line setting or without scaling-in cores of virtual machines. In this paper, we propose an opportunistic scaler, termed SmallTail, which aims to achieve stringent targets of tail latency while provisioning a minimum amount of resources and keeping them well utilized. Against dynamic workloads, SmallTail simultaneously adjusts the core provisioning per VM and probabilistically replicates requests so as to achieve the tail latency target. The core of SmallTail is a two level controller, where the outer loops controls the core provision per distributed VMs and the inner loop controls the clones in a finer granularity. We also provide theoretical analysis on the steady-state latency for a given probabilistic replication that clones one out of N arriving requests. We extensively evaluate SmallTail on three different web systems, namely web commerce, web searching, and web bulletin board. Our testbed results show that SmallTail can ensure the 95 th latency below 1000 ms using up to 53% less cores compared to the strategy of constant cloning, whereas scaling-core only solution exceeds the latency target by up to 70%.

Published at:
2018 IEEE International Conference on Autonomic Computing
(ICAC 2018)

Efficient Resource Management for Data Centers: The ACTiCLOUD Approach [SAMOS XVIII]

Vasileios Karakostas, Georgios Goumas, Ewnetu Bayuh Lakew, Erik Elmroth, Stefanos Gerangelos, Simon Kolberg, Konstantinos Nikas, Stratos Psomadakis, Dimitrios Siakavaras, Petter Svärd, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast resources efficiently. Resources are stranded and fragmented, limiting cloud applicability only to classes of applications that pose moderate resource demands. In addition, the need for reduced cost through consolidation introduces performance interference, as multiple VMs are co-located on the same nodes. To avoid such issues, current providers follow a rather conservative approach regarding resource management that leads to significant underutilization. ACTiCLOUD is a three-year Horizon 2020 project that aims at creating a novel cloud architecture that breaks existing scale-up and share-nothing barriers and enables the holistic management of physical resources, at both local and distributed cloud site levels. This extended abstract provides a brief overview of the resource management part of ACTiCLOUD, focusing on the design principles and the components

Published at:
IEEE International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation

Finding the Pitfalls in Query Performance [DBTest'18]

M.L. Kersten, P. Koutsourakis, Y. Zhang

Despite their popularity, database benchmarks only highlight a small part of the capabilities of any given system. They do not necessarily highlight problematic components encountered in real life or provide hints for further research and engineering. In this paper we introduce discriminative performance benchmarking, which aids in exploring a larger search space to find performance outliers and their underlying cause. The approach is based on deriving a domain specific language from a sample query to identify a query workload. SQLscalpel subsequently explores the space using query morphing, and simulated annealing to find performance outliers, and the query components responsible. To speed-up the exploration for often time-consuming experiments SQLscalpel has been designed to run asynchronously on a large cluster of machines.

Published at:
Workshop on Testing Database Systems

On the future of research VMs: a hardware/software perspective [Programming'18]

Foivos S. Zakkak, Andy Nisbet, John Mawer, Tim Hartley, Nikos Foutris, Orion Papadakis, Andreas Andronikakis, Iain Apreotesei, Christos Kotselidis

In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.

Published at:
Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming

Type Information Elimination from Objects on Architectures with Tagged Pointers Support [IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )]

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján

mplementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for tagged pointers by ignoring a number of bits - tag - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.

Published at:
IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )

Cross-ISA debugging in meta-circular VMs [VMIL'17]

Christos Kotselidis, Andy Nisbet, Foivos S. Zakkak, Nikos Foutris

Extending current Virtual Machine implementations to new Instruction Set Architectures entails a significant programming and debugging effort. Meta-circular VMs add another level of complexity towards this aim since they have to compile themselves with the same compiler that is being extended. Therefore, having low-level debugging tools is of vital importance in decreasing development time and bugs introduced. In this paper we describe our experiences in extending Maxine VM to the ARMv7 architecture. During that process, we developed a QEMU-based toolchain which enables us to debug a wide range of VM features in an automated way. The presented toolchain has been integrated with the JUNIT testing framework of Maxine VM and is capable of executing from simple assembly instructions to fully JIT compiled code. Furthermore, it is fully open-sourced and can be adapted to any other VMs seamlessly. Finally, we describe a compiler-assisted methodology that helps us identify, at runtime, faulty methods that generate no stack traces, in an automatic and fast manner.

Published at:
9th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages

Experiences with Building Domain-Specific Compilation Plugins in Graal [ManLang'17]

Colin Barrett, Christos Kotselidis, Foivos S. Zakkak, Nikos Foutris, Mikel Luján

In this paper, we describe our experiences in co-designing a domain-specific compilation stack. Our motivation stems from the missed optimization opportunities we observed while implementing a computer vision library in Java. To tackle the performance shortcomings, we developed Indigo, a computer vision API co-designed with a compilation plugin for optimizing computer vision applications. Indigo exploits the extensible nature of the Graal compiler which provides invocation plugins, that replace methods with dedicated nodes, and generates machine code compatible with both the Java Virtual Machine (JVM) and the SIMD hardware unit. Our approach improves performance by up to 66.75× when compared to pure Java implementations and by up to 2.75× when compared to the original C++ implementation. These performance improvements are the result of low-level concurrency, idiomatic implementation of algorithms, and by keeping temporary objects in the wider vector unit registers.

Published at:
14th International Conference on Managed Languages and Runtimes

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees [PACT'17]

Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.

Published at:
26th International Conference on Parallel Architectures and Compilation Techniques

ACTiCLOUD: Enabling the Next Generation of Cloud Applications [ICDCS'17]

Georgios I. Goumas, Konstantinos Nikas, Ewnetu Bayuh Lakew, Christos Kotselidis, Andrew Attwood, Erik Elmroth, Michail Flouris, Nikos Foutris, John Goodacre, Davide Grohmann, Vasileios Karakostas, Panagiotis Koutsourakis, Martin L. Kersten, Mikel Luján, Einar Rustad, John Thomson, Luis Tomás, Atle Vesterkjaer, Jim Webber, Ying Zhang, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings.

Published at:
37th IEEE International Conference on Distributed Computing Systems