Flink - PyFlink - ZhangZhihuiAAA

I have installed Apache Flink and run start-cluster.sh. How to use it with PyFlink?

Perfect! Since you already have Apache Flink installed and the cluster is running, you can use PyFlink to submit jobs to that cluster. Here’s a clear step-by-step guide:

1. Install PyFlink

Even if you have Flink installed, you still need the Python API:

This gives you access to DataStream API, Table API, and TableEnvironment in Python.

2. Connect PyFlink to Your Flink Cluster

You have two main ways to run Python jobs on a Flink cluster:

Option A: Submit via Flink CLI

Create a Python job, e.g., my_job.py:

Use the Flink CLI to submit your Python job to the running cluster:

-py tells Flink it’s a Python job.
This submits the job to the running cluster you started with start-cluster.sh.
You can monitor the job on http://localhost:8081 in the Web UI.

Option B: Use PyFlink TableEnvironment with Remote Cluster

You can also programmatically connect to a running cluster:

This approach is more advanced and useful for Table API jobs and SQL queries.
For most streaming or batch jobs, submitting via the CLI is simpler.

3. Monitoring Jobs

Open Flink Web UI: http://localhost:8081
You can see running jobs, task managers, and metrics.
Logs for each job are available in log/ directory of your Flink installation.

4. Notes

PyFlink jobs require Java, so ensure your JAVA_HOME is set.
If you want to use Kafka, filesystem connectors, or other sources/sinks, make sure the appropriate Flink connectors JARs are in lib/.

✅ Summary

If you want, I can write a complete PyFlink streaming example that reads from Kafka and writes to console using your running Flink cluster—this is a very realistic next step.

Do you want me to do that?

Yes, I do.

Perfect! Since your Flink cluster is running, I’ll write a full PyFlink streaming example that:

Reads messages from a Kafka topic
Processes them (optional transformation)
Writes results to the console

This example assumes:

Kafka is running at localhost:9092
Topic is called input_topic
Flink cluster is accessible locally

1. Install Required Dependencies

Make sure you have:

flink-connector-kafka_2.12-1.20.0.jar → place in flink/lib/

2. PyFlink Streaming Job (`kafka_to_console.py`)

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import KafkaSource
from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.typeinfo import Types

# 1. Create StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)  # Optional: keep 1 for testing

# 2. Define Kafka Source
kafka_source = KafkaSource.builder() \
    .set_bootstrap_servers("localhost:9092") \
    .set_topics("input_topic") \
    .set_group_id("pyflink_consumer_group") \
    .set_value_only_deserializer(SimpleStringSchema()) \
    .build()

# 3. Create DataStream from Kafka
ds = env.from_source(
    source=kafka_source,
    watermark_strategy=None,
    source_name="KafkaSource"
)

# 4. Optional: simple transformation
ds = ds.map(lambda x: f"Processed: {x}", output_type=Types.STRING())

# 5. Print to console
ds.print()

# 6. Execute the streaming job
env.execute("Kafka to Console Streaming Job")

3. Run the PyFlink Job on Your Cluster

From your Flink installation folder:

This submits the job to your running Flink cluster.
You can monitor it in the Flink Web UI: http://localhost:8081

4. Notes

Kafka connector: Make sure the Kafka connector JAR is in your Flink lib/ folder.
Parallelism: Adjust based on your cluster.
Kafka topic: Make sure the topic exists and has messages.

This is a fully working PyFlink streaming example that connects to Kafka, processes the stream, and prints to console.


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

ChatGPT said:

1. Install PyFlink

2. Connect PyFlink to Your Flink Cluster

Option A: Submit via Flink CLI

Option B: Use PyFlink TableEnvironment with Remote Cluster

3. Monitoring Jobs

4. Notes

You said:

ChatGPT said:

1. Install Required Dependencies

2. PyFlink Streaming Job (kafka_to_console.py)

3. Run the PyFlink Job on Your Cluster

4. Notes

2. PyFlink Streaming Job (`kafka_to_console.py`)