[ChatGPT] PySpark consume Kafka message in Windows Ubuntu

PySpark consume Kafka message in Windows Ubuntu

Since you will be performing a series of operations on Windows, I recommend downloading a Windows terminal tool to better manage the command windows

Install Ubuntu in Windows

Open a CMD terminal window and run wsl --list --online. If you can see Ubuntu-20.04 listed, run wsl --install -d Ubuntu-20.04 to install it.

By default, it will be installed on the C drive. If you want to move it to another disk, follow the steps below. Let's assume we want to move it to the D drive.

Export the distribution

Create a .tar file with the distribution to move using wsl.exe --export

wsl --export Ubuntu-20.04 d:\temp\Ubuntu-20.04.tar

Import the distribution into the target folder

Then, you can import the exported distribution into another folder.

wsl --import UbuntuCustom d:\ubuntu d:\temp\Ubuntu-20.04.tar

Delete old installation (optional)

wsl --unregister Ubuntu-20.04

Set up the default user

Edit /etc/wsl.conf and add:

[user]
default = <your username>

Run new distribution

wsl -d UbuntuCustom

Install Kafka in Ubuntu

Update the System

Ensure that your system is up to date by running the following commands:

sudo apt update
sudo apt upgrade

Install Java Development Kit (JDK)

Kafka requires Java to run. You can install OpenJDK using the following command:

sudo apt install default-jdk

Download and Extract Kafka

Go to the Apache Kafka website and check for the latest stable version of Kafka. Then, use the wget command to download the Kafka binaries. For example:

wget https://downloads.apache.org/kafka/3.4.1/kafka_2.13-3.4.1.tgz

Once the download is complete, extract the downloaded archive:

tar -xzf kafka_2.13-3.4.1.tgz

Configure Kafka

By default, Kafka uses ZooKeeper for coordination. However, starting from Kafka 2.8.0, Kafka has replaced ZooKeeper with its own built-in coordination mechanism. If you're using an older version of Kafka, you'll need to configure ZooKeeper.

To configure Kafka, navigate to the extracted Kafka directory:

cd kafka_2.13-3.4.1

Next, you'll need to edit the server.properties file. Open it using a text editor:

nano config/server.properties

Find and modify the following properties:

# The IP address the broker will bind to on startup.
advertised.listeners=PLAINTEXT://localhost:9092

# A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
# This should point to the Kafka broker running on your machine.
bootstrap.servers=localhost:9092

Save and close the file.

Start Kafka

To start Kafka, you'll need to start both the ZooKeeper (if using an older version) and Kafka servers. Open two separate terminal windows and navigate to the Kafka directory in each window.

In the first terminal, start ZooKeeper (if using an older version):

bin/zookeeper-server-start.sh config/zookeeper.properties

In the second terminal, start the Kafka server:

bin/kafka-server-start.sh config/server.properties

Test Kafka

To test Kafka, you can create a topic and produce/consume messages.

In a new terminal, navigate to the Kafka directory, and run the following command to create a topic named "test_topic":

bin/kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Next, open two more terminals. In one terminal, run the following command to start a producer that will send messages to the "test_topic":

bin/kafka-console-producer.sh --topic test_topic --bootstrap-server localhost:9092

In the other terminal, start a consumer to receive messages from the "test_topic":

bin/kafka-console-consumer.sh --topic test_topic --bootstrap-server localhost:9092 --from-beginning

You can now start typing messages in the producer terminal, and they will be displayed in the consumer terminal.

Install Spark in Ubuntu

Prerequisites

Before installing Spark, ensure that you have Java Development Kit (JDK) installed on your system. You can install OpenJDK using the following command:

sudo apt install default-jdk

Download and Extract Spark

Go to the Apache Spark website and check for the latest stable version of Spark. Then, use the wget command to download the Spark binaries. For example:

wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

Once the download is complete, extract the downloaded archive:

tar -xzf spark-3.4.1-bin-hadoop3.tgz

Configure Spark Environment

Navigate to the extracted Spark directory:

cd spark-3.4.1-bin-hadoop3

Copy the template configuration file:

cp conf/spark-env.sh.template conf/spark-env.sh

Edit the conf/spark-env.sh file:

nano conf/spark-env.sh

Uncomment the JAVA_HOME line and set it to your Java installation path. For example:

export JAVA_HOME=/usr/lib/jvm/default-java

Save and close the file.

Set Spark as an Environment Variable

To use Spark globally on your system, you can set it as an environment variable. Open the .bashrc file using a text editor:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/root/spark-3.4.1-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin

Save and close the file. Then, apply the changes by running:

source ~/.bashrc

Test Spark

To test the Spark installation, you can run a sample application.

In a terminal, start the Spark shell:

spark-shell

This will launch the Spark shell, and you should see the Spark logo and the Scala prompt.

Enter the following command in the Spark shell to test Spark's functionality:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.reduce((a, b) => a + b)

If Spark is installed correctly, it will perform the reduce operation and display the result.

Consume Kafka message with Pyspark

Install pip

Install pip by running the following command:

sudo apt install python3-pip

Install PySpark

PySpark is included in the Spark distribution, so if you have followed the previous instructions to install Spark, you should already have PySpark installed. If not, you can install it using the following command:

pip install pyspark

Set up Kafka Dependencies

PySpark requires additional dependencies to work with Kafka. You can install them using pip:

pip install kafka-python

Configure Spark to Use Kafka

To configure Spark to consume data from Kafka, you need to set the necessary Kafka properties. You can provide these properties when creating a SparkSession or by modifying the Spark configuration.

Here's an example of setting the Kafka properties in your Spark application:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("KafkaConsumerApp") \
    .config("spark.executor.extraClassPath", "/root/kafka_2.13-3.4.1/libs/kafka-clients-3.4.1.jar") \
    .config("spark.executor.extraLibraryPath", "/root/kafka_2.13-3.4.1/libs") \
    .getOrCreate()

# Set Kafka properties
kafkaParams = {
    "kafka.bootstrap.servers": "localhost:9092",
    "subscribe": "test_topic",
    "group.id": "your_consumer_group_id"
}

# Read from Kafka topic
df = spark \
    .read \
    .format("kafka") \
    .options(**kafkaParams) \
    .load()


# Process the Kafka data
# (e.g., display the consumed messages)
df.show()
#df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
#    .show(truncate=False)

# Continue processing or analysis on the DataFrame...

Submit and Run the PySpark Application

Save the above code to a Python script (e.g., kafka_consumer.py), and submit it to Spark for execution:

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1 kafka_consumer.py

Spark will start the application, connect to Kafka, consume messages from the specified topic, and process them according to your logic.

posted @ 2023-07-10 13:32  Destiny130  阅读(29)  评论(0)    收藏  举报