This post will discuss RisingWave, an open-source real-time analytics database designed for high-concurrency, large-scale, and low-latency data processing. It features a cloud-native, distributed architecture, ensuring scalability and high availability for real-time analytics and online transaction processing. RisingWave efficiently handles both stateless and stateful SQL computations, making it adept at managing streaming data. The hands-on project offers insights into setting up the environment, data ingestion, stream processing, and integrating with other services for a comprehensive understanding of its capabilities in real-time data analysis.
Sources:
Contents:
RisingWave is an open-source, real-time analytics database designed to handle large-scale, high-concurrency, and low-latency data processing tasks. It's built to efficiently manage streaming data, making it suitable for applications such as real-time analytics, online transaction processing (OLTP), and more. RisingWave adopts a cloud-native, distributed architecture to offer high availability and scalability.
Architecture
Frontend
The Frontend serves as the interface to the outside world. It processes SQL queries from clients, plans the execution of these queries, and coordinates the query execution across the ComputeNodes and other components. It's essentially the brain of the operation, translating user requests into actionable tasks.
ComputeNode
ComputeNodes are responsible for the actual data processing. They execute the tasks assigned by the Frontend, such as filtering, aggregation, and joining of data streams. These nodes can scale horizontally to increase processing capacity as data volume or query complexity grows.
Compactor
The Compactor optimizes data storage. It consolidates smaller data files into larger ones, improving storage efficiency and query performance. This process helps manage data at scale, ensuring that the storage layer remains efficient and performant over time.
MetaServer
The MetaServer manages metadata and cluster coordination. It keeps track of the schema, table information, and the state of various components within the system. It also helps in managing the cluster, including tasks like distributing workloads and balancing resources across ComputeNodes.
Properties
SQL computation
Stateless Computations
Stateless computations involve operations where each execution is independent of previous ones. Examples include:
Simple Transformations:
SELECT UPPER(columnName) FROM table
for converting text to uppercase, which doesn't rely on previous rows' data.
Filtering Based on Static Conditions:
SELECT * FROM table WHERE columnValue < 100
for selecting rows based on a static condition without needing historical data.
Aggregate Functions on Complete Data Sets:
SELECT COUNT(columnName) FROM table
calculates the total number of entries in a column without segmenting the data over time or by sessions.
These examples reflect the efficiency of RisingWave in distributing stateless computations across nodes, enhancing scalability and fault tolerance.
Stateful Computations
Stateful computations are those that require knowledge of past computations or the state of a dataset over time. Examples include:
Windowed Aggregations:
SELECT AVG(columnName) OVER (PARTITION BY anotherColumn ORDER BY timeColumn RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW) FROM table
calculates the average over a moving time window, requiring the system to maintain state about past values.
Join Operations Requiring State: A stateful join example in streaming data might involve joining two streams based on timestamps, where the system needs to wait for matching timestamps to arrive in both streams.
Stateful computations in RisingWave are supported by sophisticated mechanisms for state management and consistency, ensuring that as data evolves, the system can accurately compute results that reflect the most current state of the world.
Hands-on project
Access the project's code by visiting the GitHub repository linked here. There, you will find a README and workshop files containing detailed instructions for the project.
Below is an outline of the project's structure:
Setup and Data Ingestion
Environment Setup: Begin by setting up your environment according to the README, ensuring all necessary tools and dependencies are installed.
Commands: Utilize
source commands.sh
in every new terminal session to load environment variables and commands essential for the workshop.Docker Compose: Use a modified
docker-compose.yml
file to include RisingWave and other services like Clickhouse, Redpanda, Grafana, Prometheus, MinIO, and Etcd for a comprehensive setup.
Ingesting Data into RisingWave
Kafka Ingestion: Utilize the
seed_kafka.py
file to ingesttaxi_zone
data directly andtrip_data
via Kafka, simulating real-time data by adjusting timestamps to be near the current time.Data Verification: Ensure data is correctly ingested into RisingWave by checking the tables with SQL queries in a
psql
session.
Stream Processing with Materialized Views
Real-Time Data Processing: Start processing the real-time data stream ingested into RisingWave by validating the ingested data and then moving on to more complex analyses.
Materialized Views (MV): Create various MVs for analyzing data, such as total airport pickups, longest trips, busiest zones, and average fare amount vs. number of rides. These MVs leverage SQL features like joins, window functions, and temporal filters to generate real-time insights from the streamed data.
Visualization and Analysis: Use Grafana for visualizing data and Clickhouse for further analysis of the data processed by RisingWave.
Data Sinking to ClickHouse
ClickHouse Integration: After processing the data in RisingWave, sink it to ClickHouse for extended analysis. This involves creating ClickHouse tables and sinks in RisingWave to transfer the data seamlessly.
Key Takeaways
RisingWave's integration with Kafka for data ingestion and Clickhouse for data sinking showcases its versatility in stream processing ecosystems.
The use of materialized views for real-time data analysis in RisingWave demonstrates powerful SQL-based stream processing capabilities.
Adjusting timestamps for simulating real-time data and the creative use of SQL for stream analytics highlight the practical aspects of dealing with streaming data.