[ChatGPT] PySpark consume Kafka message in Windows Ubuntu
PySpark consume Kafka message in Windows Ubuntu
Since you will be performing a series of operations on Windows, I recommend downloading a Windows terminal tool to better manage the command windows
Install Ubuntu in Windows
Open a CMD terminal window and run wsl --list --online. If you can see Ubuntu-20.04 listed, run wsl --install -d Ubuntu-20.04 to install it.
By default, it will be installed on the C drive. If you want to move it to another disk, follow the steps below. Let's assume we want to move it to the D drive.
Export the distribution
Create a .tar file with the distribution to move using wsl.exe --export
wsl --export Ubuntu-20.04 d:\temp\Ubuntu-20.04.tar
Import the distribution into the target folder
Then, you can import the exported distribution into another folder.
wsl --import UbuntuCustom d:\ubuntu d:\temp\Ubuntu-20.04.tar
Delete old installation (optional)
wsl --unregister Ubuntu-20.04
Set up the default user
Edit /etc/wsl.conf and add:
[user]
default = <your username>
Run new distribution
wsl -d UbuntuCustom
Install Kafka in Ubuntu
Update the System
Ensure that your system is up to date by running the following commands:
sudo apt update
sudo apt upgrade
Install Java Development Kit (JDK)
Kafka requires Java to run. You can install OpenJDK using the following command:
sudo apt install default-jdk
Download and Extract Kafka
Go to the Apache Kafka website and check for the latest stable version of Kafka. Then, use the wget command to download the Kafka binaries. For example:
wget https://downloads.apache.org/kafka/3.4.1/kafka_2.13-3.4.1.tgz
Once the download is complete, extract the downloaded archive:
tar -xzf kafka_2.13-3.4.1.tgz
Configure Kafka
By default, Kafka uses ZooKeeper for coordination. However, starting from Kafka 2.8.0, Kafka has replaced ZooKeeper with its own built-in coordination mechanism. If you're using an older version of Kafka, you'll need to configure ZooKeeper.
To configure Kafka, navigate to the extracted Kafka directory:
cd kafka_2.13-3.4.1
Next, you'll need to edit the server.properties file. Open it using a text editor:
nano config/server.properties
Find and modify the following properties:
# The IP address the broker will bind to on startup.
advertised.listeners=PLAINTEXT://localhost:9092
# A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
# This should point to the Kafka broker running on your machine.
bootstrap.servers=localhost:9092
Save and close the file.
Start Kafka
To start Kafka, you'll need to start both the ZooKeeper (if using an older version) and Kafka servers. Open two separate terminal windows and navigate to the Kafka directory in each window.
In the first terminal, start ZooKeeper (if using an older version):
bin/zookeeper-server-start.sh config/zookeeper.properties
In the second terminal, start the Kafka server:
bin/kafka-server-start.sh config/server.properties
Test Kafka
To test Kafka, you can create a topic and produce/consume messages.
In a new terminal, navigate to the Kafka directory, and run the following command to create a topic named "test_topic":
bin/kafka-topics.sh --create --topic test_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Next, open two more terminals. In one terminal, run the following command to start a producer that will send messages to the "test_topic":
bin/kafka-console-producer.sh --topic test_topic --bootstrap-server localhost:9092
In the other terminal, start a consumer to receive messages from the "test_topic":
bin/kafka-console-consumer.sh --topic test_topic --bootstrap-server localhost:9092 --from-beginning
You can now start typing messages in the producer terminal, and they will be displayed in the consumer terminal.
Install Spark in Ubuntu
Prerequisites
Before installing Spark, ensure that you have Java Development Kit (JDK) installed on your system. You can install OpenJDK using the following command:
sudo apt install default-jdk
Download and Extract Spark
Go to the Apache Spark website and check for the latest stable version of Spark. Then, use the wget command to download the Spark binaries. For example:
wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
Once the download is complete, extract the downloaded archive:
tar -xzf spark-3.4.1-bin-hadoop3.tgz
Configure Spark Environment
Navigate to the extracted Spark directory:
cd spark-3.4.1-bin-hadoop3
Copy the template configuration file:
cp conf/spark-env.sh.template conf/spark-env.sh
Edit the conf/spark-env.sh file:
nano conf/spark-env.sh
Uncomment the JAVA_HOME line and set it to your Java installation path. For example:
export JAVA_HOME=/usr/lib/jvm/default-java
Save and close the file.
Set Spark as an Environment Variable
To use Spark globally on your system, you can set it as an environment variable. Open the .bashrc file using a text editor:
nano ~/.bashrc
Add the following lines at the end of the file:
export SPARK_HOME=/root/spark-3.4.1-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
Save and close the file. Then, apply the changes by running:
source ~/.bashrc
Test Spark
To test the Spark installation, you can run a sample application.
In a terminal, start the Spark shell:
spark-shell
This will launch the Spark shell, and you should see the Spark logo and the Scala prompt.
Enter the following command in the Spark shell to test Spark's functionality:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.reduce((a, b) => a + b)
If Spark is installed correctly, it will perform the reduce operation and display the result.
Consume Kafka message with Pyspark
Install pip
Install pip by running the following command:
sudo apt install python3-pip
Install PySpark
PySpark is included in the Spark distribution, so if you have followed the previous instructions to install Spark, you should already have PySpark installed. If not, you can install it using the following command:
pip install pyspark
Set up Kafka Dependencies
PySpark requires additional dependencies to work with Kafka. You can install them using pip:
pip install kafka-python
Configure Spark to Use Kafka
To configure Spark to consume data from Kafka, you need to set the necessary Kafka properties. You can provide these properties when creating a SparkSession or by modifying the Spark configuration.
Here's an example of setting the Kafka properties in your Spark application:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("KafkaConsumerApp") \
.config("spark.executor.extraClassPath", "/root/kafka_2.13-3.4.1/libs/kafka-clients-3.4.1.jar") \
.config("spark.executor.extraLibraryPath", "/root/kafka_2.13-3.4.1/libs") \
.getOrCreate()
# Set Kafka properties
kafkaParams = {
"kafka.bootstrap.servers": "localhost:9092",
"subscribe": "test_topic",
"group.id": "your_consumer_group_id"
}
# Read from Kafka topic
df = spark \
.read \
.format("kafka") \
.options(**kafkaParams) \
.load()
# Process the Kafka data
# (e.g., display the consumed messages)
df.show()
#df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
# .show(truncate=False)
# Continue processing or analysis on the DataFrame...
Submit and Run the PySpark Application
Save the above code to a Python script (e.g., kafka_consumer.py), and submit it to Spark for execution:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1 kafka_consumer.py
Spark will start the application, connect to Kafka, consume messages from the specified topic, and process them according to your logic.

浙公网安备 33010602011771号