ZhangZhihui's Blog  
I have installed Apache Flink and run start-cluster.sh. How to use it with PyFlink?
 
ChatGPT said:

Perfect! Since you already have Apache Flink installed and the cluster is running, you can use PyFlink to submit jobs to that cluster. Here’s a clear step-by-step guide:


1. Install PyFlink

Even if you have Flink installed, you still need the Python API:

pip install apache-flink

This gives you access to DataStream API, Table API, and TableEnvironment in Python.


2. Connect PyFlink to Your Flink Cluster

You have two main ways to run Python jobs on a Flink cluster:

Option A: Submit via Flink CLI

  1. Create a Python job, e.g., my_job.py:

from pyflink.datastream import StreamExecutionEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
env.from_collection([(1, 'Alice'), (2, 'Bob')]).print()
env.execute("PyFlink Cluster Job")
  1. Use the Flink CLI to submit your Python job to the running cluster:

./bin/flink run -py my_job.py
  • -py tells Flink it’s a Python job.

  • This submits the job to the running cluster you started with start-cluster.sh.

  • You can monitor the job on http://localhost:8081 in the Web UI.


Option B: Use PyFlink TableEnvironment with Remote Cluster

You can also programmatically connect to a running cluster:

from pyflink.table import TableEnvironment, EnvironmentSettings

# Connect to a remote (or local) cluster
settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(settings)

# Example Table API job
t_env.execute_sql("""
    CREATE TABLE MySource (
        id INT,
        name STRING
    ) WITH (
        'connector' = 'datagen'
    )
""")

t_env.execute_sql("SELECT * FROM MySource").print()
  • This approach is more advanced and useful for Table API jobs and SQL queries.

  • For most streaming or batch jobs, submitting via the CLI is simpler.


3. Monitoring Jobs

  • Open Flink Web UI: http://localhost:8081

  • You can see running jobs, task managers, and metrics.

  • Logs for each job are available in log/ directory of your Flink installation.


4. Notes

  • PyFlink jobs require Java, so ensure your JAVA_HOME is set.

  • If you want to use Kafka, filesystem connectors, or other sources/sinks, make sure the appropriate Flink connectors JARs are in lib/.


Summary

# 1. Install PyFlink
pip install apache-flink

# 2. Run cluster (already done)
./bin/start-cluster.sh

# 3. Submit a Python job
./bin/flink run -py my_job.py

# 4. Monitor
http://localhost:8081

 


If you want, I can write a complete PyFlink streaming example that reads from Kafka and writes to console using your running Flink cluster—this is a very realistic next step.

Do you want me to do that?

 

You said:

Yes, I do.

 
ChatGPT said:

Perfect! Since your Flink cluster is running, I’ll write a full PyFlink streaming example that:

  • Reads messages from a Kafka topic

  • Processes them (optional transformation)

  • Writes results to the console

This example assumes:

  • Kafka is running at localhost:9092

  • Topic is called input_topic

  • Flink cluster is accessible locally


1. Install Required Dependencies

Make sure you have:

pip install apache-flink

Also, make sure your Flink cluster has Kafka connector JARs in lib/. You can download them from Flink’s downloads page. Example for Flink 1.20:

  • flink-connector-kafka_2.12-1.20.0.jar → place in flink/lib/


2. PyFlink Streaming Job (kafka_to_console.py)

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import KafkaSource
from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.typeinfo import Types

# 1. Create StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)  # Optional: keep 1 for testing

# 2. Define Kafka Source
kafka_source = KafkaSource.builder() \
    .set_bootstrap_servers("localhost:9092") \
    .set_topics("input_topic") \
    .set_group_id("pyflink_consumer_group") \
    .set_value_only_deserializer(SimpleStringSchema()) \
    .build()

# 3. Create DataStream from Kafka
ds = env.from_source(
    source=kafka_source,
    watermark_strategy=None,
    source_name="KafkaSource"
)

# 4. Optional: simple transformation
ds = ds.map(lambda x: f"Processed: {x}", output_type=Types.STRING())

# 5. Print to console
ds.print()

# 6. Execute the streaming job
env.execute("Kafka to Console Streaming Job")

 


3. Run the PyFlink Job on Your Cluster

From your Flink installation folder:

./bin/flink run -py /path/to/kafka_to_console.py
  • This submits the job to your running Flink cluster.

  • You can monitor it in the Flink Web UI: http://localhost:8081


4. Notes

  • Kafka connector: Make sure the Kafka connector JAR is in your Flink lib/ folder.

  • Parallelism: Adjust based on your cluster.

  • Kafka topic: Make sure the topic exists and has messages.


This is a fully working PyFlink streaming example that connects to Kafka, processes the stream, and prints to console.

 

posted on 2025-12-01 22:08  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报