SCS Lab has transitioned into Gnosis Research Center (GRC). This website is archived. Please visit the new website at https://grc.iit.edu.

ChronoLog

A High-Performance Storage Infrastructure for Activity and Log Workloads
(NSF CSSI 2104013)

Background

Modern application domains in science and engineering, from astrophysics to web services and financial computations generate massive amounts of data at unprecedented rates (reaching up to 7 TB/s).

The promise of future data utility and the low cost of data storage caused by recent hardware innovations (less than $0.02/GB) is driving this data explosion resulting in widespread data hoarding from researchers and engineers.

This trend stresses existing storage systems past their capability and exceeds the capacity of even the largest computing systems, and it is becoming increasingly important to store and process activity (log) data generated by people and applications.

Scientific Applications

Internet Companies

Financial Applications

Microservices and Containers

IoT

Task-based Computing

Distributed log stores require:

1

Total ordering

2

High concurrency and parallelism

3

Capacity scaling

ChronoLog Project Synopsis

This project will design and implement ChronoLog, a distributed and tiered shared log storage ecosystem. ChronoLog uses physical time to distribute log entries while providing total log ordering. It also utilizes multiple storage tiers to elastically scale the log capacity (i.e., auto-tiering). ChronoLog will serve as a foundation for developing scalable new plugins, including a SQL-like query engine for log data, a streaming processor leveraging the time-based data distribution, a log-based key-value store, and a log-based TensorFlow module.

Workloads and Applications

Modern applications spanning from Edge to High Performance Computing (HPC) systems, produce and process log data and create a plethora of workload characteristics that rely on a common storage model:
the distributed shared log

01.

Low latency and fast appends

Key-value stores and column databases
Message brokers
Metadata coordination
File system namespace services

02.

High-throughput for historical log data

Search query engines
Machine Learning (ML) training pipelines
Graph exploration

03.

Data durability

Transactions
Fault tolerance in databases

04.

Write availability

Streaming applications
Replication engine

05.

Efficient range queries

Time-series applications
Indexing

06.

Tunable parallelism
(log seasonality)

Sensor data analysis
Modern microservices
Containerized workloads

Challenges

01.

Ensuring total ordering of log data in a distributed environment is expensive

Single point of contention (tail of the log)
High cost of synchronization
Centralized sequencers

02.

Efficiently scale log capacity

Time- or space-based data retention policies
Add servers offline and rebalance cluster

03.

Highly concurrent log operations by multiple clients

Single-writer-multiple-readers (SWMR) data access model
Limited operation concurrency

04.

I/O parallelism

Application-centric implicit parallel model (e.g., consumer groups)

05.

Partial data retrieval (log querying)

Expensive auxiliary indices
Metadata look-ups
Client-side epochs

Key Insights & Project Impact

01.

Combining the append-only nature of a log abstraction with the natural strict order of a global truth, such as physical time, can be used to build a distributed shared log store that avoids expensive synchronizations.

02.

An efficient mapping of the log entries to the tiers of the hierarchy (i.e., auto-tiering a log) provides: improved log capacity, tunable access parallelism, and I/O isolation between tail and historical log operations.

ChronoLog offers a powerful, versatile, and scalable primitive that can connect two or more stages of a data processing pipeline (i.e., a producer/consumer model) without explicit control of the data flow while maintaining data durability by providing:

01.

An authoritative source of strong consistency

02.

A durable data store with fast appends and “commit” semantics

03.

An arbitrator offering transactional isolation, atomicity, and durability

04.

A consensus engine for consistent replication and indexing services

05.

An execution history for replica creation and geo-distribution

06.

A scalable data integration and warehousing solution

07.

An external data subscription feed for real-time streaming processors

08.

A scalable storage backend for several distributed services

ChronoLog Research Contributions

ChronoLog demonstrates:

How physical time can be used to distribute and order log data without the need for explicit synchronizations or centralized sequencing offering a high-performance and scalable shared log store with the ability to support efficient range data retrieval.

How multi-tiered storage can be used to scale the capacity of a log and offer tunable data access parallelism while maintaining I/O isolation and a highly concurrent access model supporting multiple-writers-multiple-readers (MWMR).

How elastic storage semantics and dynamic resource provisioning can be used to achieve an efficient matching between I/O production and consumption rates of conflicting workloads under a single system.

Software Contributions

ChronoLog will create a future-proof storage infrastructure targeting large-scale systems and applications

01.

The ChronoLog Core Library

02.

An in-memory lock-free distributed journal with the ability to persist data on flash storage to enable a fast distributed caching layer

03.

A new high-performance data streaming service specifically, but not limited, for HPC systems that uses MPI for job distribution and RPC over RDMA (or over RoCE) for communication

04.

A set of high-level interfaces for I/O

Design Requirements

Log Distribution

Log data should be distributed at a log-entry (rather than log-partition) granularity, resulting in a highly parallel distribution model. Further, log data should be distributed both horizontally (i.e., multiple nodes) and vertically (i.e., multiple tiers of storage).

Log Ordering

Finding the tail of the log should be free of expensive synchronization operations (e.g., metadata locking) or a centralized sequencer that enforces the order. Additionally, the system should guarantee ordering of entries on the entirety of a log and not only on a log-partition granularity.

Log Access

Interacting with the log should follow a highly concurrent access model providing multiple-writer-multiple-reader (MWMR) semantics. Further, tail and historical log operations should be handled separately via I/O isolation. The log should not favor one type of operation over the other offering high performance for both.

Log Querying

The log must be able to be partially processed via range retrieval mechanisms (i.e., partial get) moving away from the restrictive sequential access model imposed by log iterators.

Log Storage

The log should be able to exploit advantages of different storage tiers. No explicit user intervention should be required. The system should map the natural ordering of log entries to the performance spectrum of each storage tier (e.g., recent entries in faster tiers, older entries in slower) to efficiently expand log capacity.

Log Scaling

Log data should be persisted by a tunable parallel I/O model to match the rate of log data production. In other words, the storage infrastructure must be elastic and adaptive via storage auto-scaling leading to better performance and resource utilization.

ChronoLog Architecture

ChronoLog is a new class of a distributed shared log store that will leverage physical time for achieving total ordering of log entries and will utilize multiple storage tiers to distribute a log both horizontally and vertically (a model we call 3D data distribution).

Data Model & API

ChronoLog’s data model revolves around the log abstraction, called a chronicle, which represents a named data structure that consists of a series of data elements ordered by physical time.

A chronicle is indexed by a configurable granularity expressed in time units (default nanoseconds). By adjusting the indexing granularity, a chronicle can group multiple events while maintaining its order.

Each data element, called an event, is a simple key-value pair. The key is a ChronoTick and the value is an uninterpreted byte array.

A ChronoTick is a monotonically increasing positive integer (uint32_t) representing the time distance from a chronicle’s base value (e.g., the creation timestamp) set by a global clock.

Design Details

The ChronoVisor

Handles client connections
Holds chronicle metadata information
Acts as the global clock enforcing time synchronization between all server nodes
Deployed on its own server node (usually a head node)

The ChronoKeeper

Serves all tail operations such as record() (append) and playback() (tail-read)
Stores incoming events in a distributed journal
Deployed on all or a subset of compute nodes that are equipped with a fast flash storage

The ChronoStore

Manages both intermediate storage resources (e.g., burst buffers or data staging resources) and the storage servers
Has the ability to grow or shrink its resources offering an elastic solution that can react to the I/O traffic
Organized into the ChronoGrapher and ChronoPlayer, which are responsible for writes and reads, respectively

The ChronoGrapher

Continuously ingests events from the ChronoKeeper
Uses a real-time data streaming approach to persist events to lower tiers of the hierarchy

The ChronoPlayer

Serves historical reads in the form of replay() (catch-up read) calls

ChronoLog Features and Operations

RECORD
AN EVENT
(APPEND)

PLAYBACK
A CHRONICLE
(TAIL READ)

REPLAY
A CHRONICLE
(HISTORICAL READ)

Log Auto-Tiering

To provide chronicle capacity scaling, ChronoLog moves data down to the next larger but slower tiers automatically. To achieve this, the ChronoGrapher offers a very fast distributed data flushing solution, that can match the event production rate, by offering:

01.

Real-time continuous data flushing

02.

Tunable parallelism via resource elasticity

03.

Storage device-aware random access

04.

A decoupled server-pull, instead of a client-push, eviction model

ChronoGrapher runs a data streaming job with three major steps represented as a DAG: event collection, story building, and story writing. Each node in the DAG is dynamic and elastic based on the incoming traffic estimated by the number of events and total size of data.

Log Querying and Range Retrieval

The ChronoPlayer is responsible for executing historical read operations.

01.

Initialized by the ChronoVisor upon system initialization

02.

Accesses data from all available tiers

03.

Implemented by a data streaming approach

04.

Real-time, decoupled, and elastic architecture

Evaluation Results

Key-Value Store

In this test, we evaluate the operation throughput of a key-value store implemented on top of a log. We use the native KVS implementations for comparable log stores: Bookkeeper Table Service, and CorfuDB. For ChronoLog, we implemented a prototype key-value store that simply maps a key to a chronicle event. We run two workloads: a) each client pushes 32K put() calls of 4KB each and then gets all keys sequentially; and b) each client puts and immediately gets keys of 4KB and does so 32K times. For ChronoLog we tested two configurations with and without event caching using a backlog. ChronoLog outperforms both competitive solutions by a significant margin depending on the test case. Specifically, ChronoLog performs 14x faster for put() and 2x-12x for get().

State-Machine Replication (SMR)

In this test, we investigate the ability of all log stores to effectively provide a fast store for state machine replicas (SMR). In this application, each client appends a command set of 4KB into the log and then it reads all events that contain the command sets from all other processes. The log offers the total ordering required to reach consensus of what command to execute next. As the number of replicas increases, more and more data are pushed to the log and creating and retrieving SMRs will eventually saturate. ChronoLog performs 5x better than Corfu and 10x than Bookkeeper leading to a larger number of replicas being saturated.

Time Series (TS) Kernel

In this test, we compare ChronoLog with TimeScaleDB by running the widely used Time Series Benchmark Suite (TSBS). The application inserts, finds, and queries the data in 4MB data ranges of 4KB events calculating Min, Max, and Average values. Both systems are configured to use the same resources (i.e., number of processes and storage devices). Since ChronoLog is designed to leverage the hierarchical storage environment, it performs up to 25% faster than TimeScaleDB since the chronicle is already indexed by physical time and distributed in multiple tiers.

More results can be found in our most recent paper.

Publications

K. Bateman, N. Rajesh, J. Cernuda Garcia, L. Logan, J. Ye, S. Herbein, A. Kougkas, X.-H. Sun. " LuxIO: Intelligent Resource Provisioning and Auto-Configuration for Storage Services," The 29th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'22).
L. Logan, J. Cernuda Garcia, J. Lofstead, X.-H. Sun, A. Kougkas. " LabStor: A Modular and Extensible Platform for Developing High-Performance, Customized I/O Stacks in Userspace," The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'22).
A. Kougkas, H. Devarajan, K. Bateman, J. Cernuda, N. Rajesh, X.-H. Sun. " ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering," Proceedings of the 36th International Conference on Massive Storage Systems and Technology (MSST 2020).

Members

Dr. Xian-He Sun

Principal Investigator
Illinois Tech

Dr. Anthony Kougkas

Co-Principal Investigator
Illinois Tech

Dr. Kyle Chard

Co-Principal Investigator
University of Chicago

Jaime Cernuda Garcia

PhD Student SCS Lab
Illinois Tech

Luke Logan

PhD Student SCS Lab
Illinois Tech

Kun Feng

HPC Engineer
Illinois Tech

Aparna Sasidharan

HPC Engineer
Illinois Tech

Inna Brodkin

Senior HPC Engineer
University of Chicago

Collaborators

Ian Foster

Collaborator
University of Chicago

Stephen Herbein

Collaborator
Lawrence Livermore National Lab

Tom Glanzman

Collaborator
SLAC National Accelerator Lab

Dries Kimpe

Collaborator
3RedPartners

Sam Lang