Aarati Kakaraparthy | publications

2023

SIGMOD Record

[Under Revision] Fine Grained Hardware Profiling – Are You Using the Right Tools?

Aarati Kakaraparthy, and Jignesh M. Patel

SIGMOD Record 2023

Abs

We consider the problem of fine-grained hardware profiling, i.e., profiling the hardware while the desired section of the program is executing. Although this requirement is frequently encountered in practice, its importance has not been emphasized in literature so far. In this work, we compare and validate three tools for performing fine-grained profiling on Linux platforms – \textttperf, PAPI, and a homegrown tool PMU-metrics. \textttperf has been used in the past for fine-grained profiling in an erroneous manner, producing inaccurate metrics as a result. On the other hand, PAPI and PMU-metrics produce accurate metrics for profiling at the \textitms-scale, while PMU-metrics enables profiling even at the \textit\mus-scale. Thus, we hope that our analysis will help systems practitioners choose the right tool for performing fine-grained profiling at different time scales.

2022

VLDB

Tenant Placement in Over-Subscribed Database-as-a-Service Clusters

Arnd Christian König, Yi Shan, Tobias Ziegler, Aarati Kakaraparthy, Willis Lang, Justin Moeller, Ajay Kalhan, and Vivek Narasayya

Proc. VLDB Endow. 2022

Abs

Relational cloud Database-as-a-Service offerings run on multi-tenant infrastructure consisting of clusters of nodes, with each node hosting multiple tenant databases. Such clusters may be over-subscribed to increase resource utilization and improve operational efficiency. When resources are over-subscribed, it is possible that anode has insufficient resources to satisfy the resource demands of all databases on it, making it necessary to move databases to other nodes. Such moves can significantly impact database performance and availability. Therefore, it is important to reduce the likelihood of such resource shortages through judicious placement of databases in the cluster. We propose a novel tenant placement approach that leverages historical traces of tenant resource demands to estimate the probability of resource shortages and leverages these estimates in placement. We have prototyped our techniques in the Service Fabric cluster manager. Experiments using production resource traces from Azure SQL DB and an evaluation on a real cluster deployment show significant improvements over the state-of-the-art.
VLDB

VIP Hashing: Adapting to Skew in Popularity of Data on the Fly

Aarati Kakaraparthy, Jignesh M. Patel, Brian P. Kroth, and Kwanghyun Park

Proc. VLDB Endow. 2022

Abs

All data is not equally popular. Often, some portion of data is more frequently accessed than the rest, which causes a skew in popularity of the data items. Adapting to this skew can improve performance, and this topic has been studied extensively in the past for disk-based settings. In this work, we consider an in-memory data structure, namely hash table, and show how one can leverage the skew in popularity for higher performance.Hashing is a low-latency operation, sensitive to the effects of caching and code complexity, among other factors. These factors make learning in-the-loop challenging as the overhead of performing additional operations can have significant impact on performance. In this paper, we propose VIP hashing, a hash table method that uses lightweight mechanisms for learning the skew in popularity and adapting the hash table layout on the fly. These mechanisms are non-blocking, i.e, the hash table is operational at all times. The overhead is controlled by sensing changes in the popularity distribution to dynamically switch-on/off the mechanisms as needed.We ran extensive tests against a host of workloads generated by Wiscer, a homegrown benchmarking tool, and we find that VIP hashing improves performance in the presence of skew (22% increase in fetch operation throughput for a hash table with 1M keys under low skew) while adapting to insert and delete operations, and changing popularity distribution of keys on the fly. Our experiments on DuckDB show that VIP hashing reduces the end-to-end execution time of TPC-H query 9 by 20% under low skew.

2021

ICDE

FPGA for Aggregate Processing: The Good, The Bad, and The Ugly

M. Eryilmaz, Aarati Kakaraparthy, Jignesh M. Patel, Rathijit Sen, and Kwanghyun Park

In International Conference on Data Engineering (ICDE) 2021

Abs

In this paper, we focus on current CPU-FPGA architectures and study their usability for database management systems. To focus our scope, we choose aggregation as the query processing primitive for this investigation. We implement a fully pipelined stall-free module that performs aggregation on the FPGA, and also describe a performance model that predicts the runtime of this module with 99% accuracy. We study the performance of this module on two different CPU-FPGA architectures, namely remote-main-memory and bump-inthe-wire. Compared to an implementation of aggregation on CPU, we find that the former is 1.7× slower whereas the latter is 2.2× faster. This significant performance gap suggests two important architectural considerations when designing CPU-FPGA systems, namely the bandwidth ceiling and the resource ceiling, while also highlighting issues of switching times and programmer efficiency. We consider broader hardware trends to study the suitability of the two FPGA architectures for accelerating the aggregation operation, and find that the performance gap is likely to stay in the coming future. Based on these observations, we discuss some challenges and opportunities for CPU-FPGA architectures.

2019

VLDB

Optimizing Databases by Learning Hidden Parameters of Solid State Drives

Aarati Kakaraparthy, Jignesh M. Patel, Kwanghyun Park, and Brian P. Kroth

Proc. VLDB Endow. 2019

Abs

Solid State Drives (SSDs) are complex devices with varying internal implementations, resulting in subtle differences in behavior between devices. In this paper, we demonstrate how a database engine can be optimized for a particular device by learning its hidden parameters. This can not only improve an application’s performance, but also potentially increase the lifetime of the SSD. Our approach for optimizing a database for a given SSD consists of three steps: learning the hidden parameters of the device, proposing rules to analyze the I/O behavior of the database, and optimizing the database by eliminating violations of these rules.We obtain two different characteristics of an SSD, namely the request size profile and the location profile, from which we learn multiple internal parameters. Based on these parameters, we propose rules to analyze the I/O behavior of a database engine. Using these rules, we uncover sub-optimal I/O patterns in SQLite3 and MariaDB when running on our experimental SSDs. Finally, we present three techniques to optimize these database engines: (1) use-hot-locations on SSD-S, which improves the SELECT operation throughput of SQLite3 and MariaDB by 29% and 27% respectively; it also improves the performance of YCSB on MariaDB by 1%-22% depending on the workload mix, (2) write-aligned-stripes on SSD-T, reduces the wear-out caused by SQLite3 write-ahead log (WAL) file by 3.1%, and (3) contain-write-in-flash-page on SSD-T, which reduces the wear-out caused by the MariaDB binary log file by 6.7%.
HotCloud

The Case for Unifying Data Loading in Machine Learning Clusters

Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman

In Proceedings of the 11th USENIX Conference on Hot Topics in Cloud Computing 2019

Abs

Training machine learning models involves iteratively fetching and pre-processing batches of data. Conventionally, popular ML frameworks implement data loading within a job and focus on improving the performance of a single job. However, such an approach is inefficient in shared clusters where multiple training jobs are likely to be accessing the same data and duplicating operations. To illustrate this, we present a case study which reveals that for hyper-parameter tuning experiments, we can reduce up to 89% I/O and 97% pre-processing redundancy.Based on this observation, we make the case for unifying data loading in machine learning clusters by bringing the isolated data loading systems together into a single system. Such a system architecture can remove the aforementioned redundancies that arise due to the isolation of data loading in each job. We introduce OneAccess, a unified data access layer and present a prototype implementation that shows a 47.3% improvement in I/O cost when sharing data across jobs. Finally we discuss open research challenges in designing and developing a unified data loading layer that can run across frameworks on shared multi-tenant clusters, including how to handle distributed data access, support diverse sampling schemes, and exploit new storage media.