Real-Time Analytics: RisingWave

DE ZoomCamp course: Workshop

Mar 17, 2024

This post will discuss RisingWave, an open-source real-time analytics database designed for high-concurrency, large-scale, and low-latency data processing. It features a cloud-native, distributed architecture, ensuring scalability and high availability for real-time analytics and online transaction processing. RisingWave efficiently handles both stateless and stateful SQL computations, making it adept at managing streaming data. The hands-on project offers insights into setting up the environment, data ingestion, stream processing, and integrating with other services for a comprehensive understanding of its capabilities in real-time data analysis.

Sources:

Contents:

RisingWave is an open-source, real-time analytics database designed to handle large-scale, high-concurrency, and low-latency data processing tasks. It's built to efficiently manage streaming data, making it suitable for applications such as real-time analytics, online transaction processing (OLTP), and more. RisingWave adopts a cloud-native, distributed architecture to offer high availability and scalability.

Architecture

Frontend

The Frontend serves as the interface to the outside world. It processes SQL queries from clients, plans the execution of these queries, and coordinates the query execution across the ComputeNodes and other components. It's essentially the brain of the operation, translating user requests into actionable tasks.

ComputeNode

ComputeNodes are responsible for the actual data processing. They execute the tasks assigned by the Frontend, such as filtering, aggregation, and joining of data streams. These nodes can scale horizontally to increase processing capacity as data volume or query complexity grows.

Compactor

The Compactor optimizes data storage. It consolidates smaller data files into larger ones, improving storage efficiency and query performance. This process helps manage data at scale, ensuring that the storage layer remains efficient and performant over time.

MetaServer

The MetaServer manages metadata and cluster coordination. It keeps track of the schema, table information, and the state of various components within the system. It also helps in managing the cluster, including tasks like distributing workloads and balancing resources across ComputeNodes.

Is RisingWave the Right Streaming Database for Your Stream ...

Properties

SQL computation

Stateless Computations

Stateless computations involve operations where each execution is independent of previous ones. Examples include:

Simple Transformations:
```
SELECT UPPER(columnName) FROM table
```
for converting text to uppercase, which doesn't rely on previous rows' data.
Filtering Based on Static Conditions:
```
SELECT * FROM table WHERE columnValue < 100 
```
for selecting rows based on a static condition without needing historical data.
Aggregate Functions on Complete Data Sets:
```
SELECT COUNT(columnName) FROM table 
```
calculates the total number of entries in a column without segmenting the data over time or by sessions.

These examples reflect the efficiency of RisingWave in distributing stateless computations across nodes, enhancing scalability and fault tolerance.

Stateful Computations

Stateful computations are those that require knowledge of past computations or the state of a dataset over time. Examples include:

Windowed Aggregations:

SELECT AVG(columnName) 
OVER (PARTITION BY anotherColumn 
  ORDER BY timeColumn 
  RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW) 
FROM table

calculates the average over a moving time window, requiring the system to maintain state about past values.

Join Operations Requiring State: A stateful join example in streaming data might involve joining two streams based on timestamps, where the system needs to wait for matching timestamps to arrive in both streams.

Stateful computations in RisingWave are supported by sophisticated mechanisms for state management and consistency, ensuring that as data evolves, the system can accurately compute results that reflect the most current state of the world.

Hands-on project

Access the project's code by visiting the GitHub repository linked here. There, you will find a README and workshop files containing detailed instructions for the project.

Below is an outline of the project's structure:

Setup and Data Ingestion

Environment Setup: Begin by setting up your environment according to the README, ensuring all necessary tools and dependencies are installed.
Commands: Utilize source commands.sh in every new terminal session to load environment variables and commands essential for the workshop.
Docker Compose: Use a modified docker-compose.yml file to include RisingWave and other services like Clickhouse, Redpanda, Grafana, Prometheus, MinIO, and Etcd for a comprehensive setup.

Ingesting Data into RisingWave

Kafka Ingestion: Utilize the seed_kafka.py file to ingest taxi_zone data directly and trip_data via Kafka, simulating real-time data by adjusting timestamps to be near the current time.
Data Verification: Ensure data is correctly ingested into RisingWave by checking the tables with SQL queries in a psql session.

Stream Processing with Materialized Views

Real-Time Data Processing: Start processing the real-time data stream ingested into RisingWave by validating the ingested data and then moving on to more complex analyses.
Materialized Views (MV): Create various MVs for analyzing data, such as total airport pickups, longest trips, busiest zones, and average fare amount vs. number of rides. These MVs leverage SQL features like joins, window functions, and temporal filters to generate real-time insights from the streamed data.
Visualization and Analysis: Use Grafana for visualizing data and Clickhouse for further analysis of the data processed by RisingWave.

Data Sinking to ClickHouse

ClickHouse Integration: After processing the data in RisingWave, sink it to ClickHouse for extended analysis. This involves creating ClickHouse tables and sinks in RisingWave to transfer the data seamlessly.

Key Takeaways

RisingWave's integration with Kafka for data ingestion and Clickhouse for data sinking showcases its versatility in stream processing ecosystems.
The use of materialized views for real-time data analysis in RisingWave demonstrates powerful SQL-based stream processing capabilities.
Adjusting timestamps for simulating real-time data and the creative use of SQL for stream analytics highlight the practical aspects of dealing with streaming data.

Ramazan’s Substack

Discussion about this post