Real-Time Wikipedia Monitoring Using RisingWave, Instaclustr Cloud, and Apache Superset

Stream processing tools like RisingWave and Kafka empower data engineers to extract real-time insights from live data. This capability enhances decision-making and improves user experiences in a wide range of domains, including recommender systems, finance, logistics, automotive, IIOT devices in manufacturing, and retail.

In this blog, we will delve into the seamless integration between RisingWave, Kafka deployed in the Instaclustr Cloud, and Apache Superset. We will specifically focus on the scenario of monitoring Wikipedia edits and how these tools work together to achieve that goal.

RisingWave is a PostgreSQL-compatible streaming database that provides the features of cost-efficiency, scalability, and a true cloud-native architecture. It allows users to get real-time insights from streaming data using SQL.

Instaclustr offers a fully managed and integrated platform with popular open-source tools such as Kafka, PostgreSQL, Cassandra, and Redis. It facilitates easy Kafka Connect integration and includes a dedicated ZooKeeper. This is how Instaclustr delivers a seamless Kafka journey with a 100% open-source solution.

The technical stack

We will retrieve real-time data from the Wikipedia API, specifically capturing edits on Wikipedia articles and contributor information, and publish it to a Kafka topic.

This data will then be ingested into RisingWave, where we will create materialized views to perform operations like aggregations, time window operations, and data transformations to extract valuable insights.

Finally, we will transfer the data from RisingWave to Apache Superset for visualization and monitoring of real-time edits made by various contributors to Wikipedia.

Real-time Wikipedia monitoring technical stack using Instaclustr's Kafka, RisingWave, and Apache Superset.

Deploy Kafka on the Instaclustr Cloud

In order to start producing events, we need to set up a Kafka cluster. For this demonstration, we will create a Kafka cluster on the Instaclustr Cloud.

Set up Kafka on the Instaclustr Cloud

To begin, please sign up for a free Instaclustr account that will provide you with access to Kafka services. You can create your account by visiting the Instaclustr Cloud platform.

Instaclustr Cloud Account Registration: Create Your Account for Instaclustr Cloud Services.

Please refer to the Getting Started with Apache Kafka guide to create a Kafka cluster on the Instaclustr Cloud.

Once you have successfully created the Kafka cluster, add the IP address of your computer to the cluster in order to produce and consume data.

Configuring Firewall Rules for Instaclustr Cloud: Adding Your Computer's IP Address to the Cluster for Data Producing and Consuming.

Stream Wikipedia edits to Kafka

We will utilize the Python Wikipedia API to fetch various information, such as user contributions, user details, and recent changes.

Subsequently, we will transmit this data to the Kafka cluster on Instaclustr Cloud for ingestion into RisingWave.

The JSON message follows the schema outlined below:

"contributor": Name of the Wikipedia contributor.
"title": Title of the Wikipedia article that was edited.
"edit_timestamp": Timestamp of the edit.
"registration": Registration date of the Wikipedia user.
"gender": Gender of the Wikipedia user.
"edit_count": Number of edits made by the Wikipedia user.

The sample message sent to the Kafka topic appears as follows:

{
  "contributor": "Teatreez",
  "title": "Supreme Court of South Africa",
  "edit_timestamp": "2023-12-03 18:23:02",
  "registration": "2006-12-30 18:42:21",
  "gender": "unknown",
  "edit_count": "10752"
}

Connect RisingWave to the Kafka topic

To get started with RisingWave, create a RisingWave cluster in RisingWave Cloud using the free plan. See the documentation of RisingWave Cloud for instructions.

RisingWave Cloud: Account Registration and Sign-In Process.

Note: To establish a connection between RisingWave and Instaclustr, it is important to add the NAT gateway IP addresses of your RisingWave Cloud cluster to the firewall rules of the Kafka cluster in Instaclustr Cloud. This step ensures smooth connectivity and helps avoid any potential connectivity errors.

Once you have successfully created the RisingWave cluster, proceed to create a source in RisingWave to ingest data from the Kafka topic in Instaclustr Cloud.

Use the following query to create a source that connects to the Kafka topic in Instaclustr Cloud. Make sure to fill in the authentication parameters accordingly.

CREATE SOURCE wiki_source (
  contributor VARCHAR,
  title VARCHAR,
  edit_timestamp TIMESTAMPTZ,
  registration TIMESTAMPTZ,
  gender VARCHAR,
  edit_count VARCHAR
) WITH (
  connector = 'kafka',
  topic='Insta-topic',
  properties.bootstrap.server = 'x.x.x.x:9092',
  scan.startup.mode = 'earliest',
  properties.sasl.mechanism = 'SCRAM-SHA-256',
  properties.security.protocol = 'SASL_PLAINTEXT',
  properties.sasl.username = 'ickafka',
  properties.sasl.password = 'xxxxxx'
) FORMAT PLAIN ENCODE JSON;

We create a materialized view named wiki_mv based on the source wiki_source . Note that we have filtered out the rows with null values.

CREATE MATERIALIZED VIEW wiki_mv AS
SELECT
  contributor,
  title,
  CAST(edit_timestamp AS TIMESTAMP) AS edit_timestamp,
  CAST(registration AS TIMESTAMP) AS registration,
  gender,
  CAST(edit_count AS INT) AS edit_count
FROM wiki_source
WHERE timestamp IS NOT NULL
  AND registration IS NOT NULL
  AND edit_count IS NOT NULL;

You can query the materialized view to get the fresh data from the source.

SELECT * FROM wiki_mv LIMIT 5;

The result of the above query is similar to the following:

contributor    |   title                     |     edit_timestamp             |       registration        | gender  | edit_count
---------------+-----------------------------+---------------------------+-

Omnipaedista   | Template:Good and evil      | 2023-12-03 10:22:02+00:00 | 2008-12-14 06:02:32+00:00 | male    | 222563
PepeBonus      | Moshi mo Inochi ga Egaketara| 2023-12-03 10:22:16+00:00 | 2012-06-02 13:39:53+00:00 | unknown | 20731
Koulog         | Ionikos F.C.                | 2023-12-03 10:23:00+00:00 | 2023-10-28 05:52:35+00:00 | unknown | 691
Fau Tzy        | 2023 Liga 3 Maluku          | 2023-12-03 10:23:17+00:00 | 2022-07-23 09:53:11+00:00 | unknown | 4697
Cavarrone      | Cheers (season 8)           | 2023-12-03 10:23:40+00:00 | 2008-08-23 11:13:14+00:00 | male    | 83643

(5 rows)

This query creates a materialized view, gender_mv, by aggregating contributions from the wiki_mv materialized view into one-minute intervals. The materialized view provides counts for various aspects, including total contributions, contributions made by contributors with unknown gender, and contributions made by contributors with specified male or female genders within each time window. This allows for easier analysis and monitoring of contribution patterns based on gender.

CREATE MATERIALIZED VIEW gender_mv AS
SELECT COUNT(*) AS total_contributions,
COUNT(CASE WHEN gender = 'unknown' THEN 1 END) AS contributions_by_unknown,
COUNT(CASE WHEN gender != 'unknown' THEN 1 END) AS contributions_by_male_or_female,
window_start, window_end
FROM TUMBLE (wiki_mv, edit_timestamp, INTERVAL '1 MINUTES')
GROUP BY window_start, window_end;

Similarly, this query establishes a materialized view named registration_mv, summarizing contributions from the wiki_mv materialized view into one-minute intervals. It includes counts for total contributions, contributions by accounts registered before January 1, 2020, and contributions by those registered after that date, within specified time windows.

CREATE MATERIALIZED VIEW registration_mv AS
SELECT COUNT(*) AS total_contributions,
COUNT(CASE WHEN registration < '2020-01-01 01:00:00'::timestamp THEN 1 END) AS contributions_by_someone_registered_before_2020,
COUNT(CASE WHEN registration > '2020-01-01 01:00:00'::timestamp THEN 1 END) AS contributions_by_someone_registered_after_2020,
 window_start, window_end
FROM TUMBLE (wiki_mv, edit_timestamp, INTERVAL '1 MINUTES')
GROUP BY window_start, window_end;

This query demonstrates the creation of a materialized view count_mv that summarizes contributions from the wiki_mv materialized view into one-minute intervals. The materialized view includes counts for different aspects, such as the total number of contributions, contributions made by contributors with an edit count of less than 1000, and contributions made by contributors with 1000 or more edits, within specific time windows.

CREATE MATERIALIZED VIEW count_mv AS
SELECT
    COUNT(*) AS total_contributions,
    COUNT(CASE WHEN edit_count < 1000 THEN 1 END) AS contributions_less_than_1000,
    COUNT(CASE WHEN edit_count >= 1000 THEN 1 END) AS contributions_1000_or_more,
    window_start, window_end
FROM TUMBLE(wiki_mv, edit_timestamp, INTERVAL '1 MINUTES')
GROUP BY window_start, window_end;

Send the data from RisingWave to Apache Superset for visualization

Superset is an open-source tool used for building dashboards and visualizations. We’ll configure it to read data from RisingWave and build visualizations.

Connect RisingWave to Superset

You can use RisingWave as a data source in Apache Superset for creating visualizations and dashboards using the tables and materialized views in RisingWave. To understand the process, follow the instructions in Configure Superset to read data from RisingWave.

After successfully connecting RisingWave to Apache Superset, we add the materialized views in RisingWave as datasets to create tables, various charts, and a unified dashboard.

Visualizing data with Apache Superset: table, charts, and dashboard

This table is generated using the wiki_mv dataset, showing information such as the Wikipedia contributor's name, registration date, gender, edit counts, and the Wikipedia article to which the contributor made edits.

Wikipedia Contributor Table: Featuring Contributor Name, Registration Date, Gender, Edit Counts, and Associated Wikipedia Article.

This area chart is created using the count_mv dataset, illustrating counts for total contributions, contributions with an edit count of less than 1000, and contributions with an edit count of 1000 or more, within specified time windows.

Wikipedia Contributions Chart: Illustrating Contribution Distribution by Contributor Edit Counts.

This line chart is based on the gender_mv dataset, depicting counts for total contributions, contributions from unknown gender, and contributions from specified genders during each time window.

Wikipedia Contributions Chart: Analyzing Contribution Patterns by Contributor Gender.

The bar chart displayed here is created using the registration_mv dataset. It visualizes counts for various types of contributions within 1-minute time windows. The chart provides information on the total number of contributions, contributions made by individuals registered before January 1, 2020, and contributions made by individuals registered after that date.

Wikipedia Contributions Chart: Visualizing Contributions Over Time by Contributor Registration Date.

This unified dashboard presents a collection of charts that allow for comprehensive real-time monitoring of Wikipedia edits performed by contributors. The dashboard provides valuable insights into contributor information and the articles that have been edited. By combining multiple charts, it offers a holistic view of the editing activities, contributing to a better understanding of the contributors involved and the impact of their edits on specific articles.

Dashboard for real-time monitoring of Wikipedia edits based on contributor information. > In this blog post, we present a stream processing solution specifically designed for real-time monitoring of Wikipedia edits made by diverse contributors across multiple articles. Our solution involves extracting data from the Wikipedia API and transmitting it to Kafka, which is deployed in the Instaclustr Cloud. The data is then ingested into RisingWave as a source, allowing us to create materialized views for processing and analysis. To effectively showcase the derived insights, we utilize the powerful capabilities of Superset to generate visualizations and dashboards. This combination provides a comprehensive and dynamic view of the ongoing Wikipedia edits.