spark跟pandas数据转换
https://www.jianshu.com/p/fd528b78d17e
因为传统的机器学习是基于sklearn,xgboost,有着丰富分算法库,spark mlib不能满足所有的需求. spark来处理数据预处理和特征工程,sklearn,xgboost来训练. 需要spark和sklearn,xgboost进行数据转化. pandas dataframe转 spark dataframe, import pandas as pd from pyspark.sql import SparkSession #pandas读取cvs,形成dataframe, userDF = pd.read_csv("src/main/resources/upload.csv") #启动spark spark = SparkSession \ .builder \ .appName("Python Spark SQL Hive integration example") \ .enableHiveSupport() \ .getOrCreate() #spark读取pandas dataframe,形成spark dataframe sparkDF = spark.createDataFrame(userDF) sparkDF.show() spark dataframe 转 pandas data,download.py from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL Hive integration example") \ .enableHiveSupport() \ .getOrCreate() spark.sql("CREATE TABLE IF NOT EXISTS user (userid int, name string)") spark.sql("LOAD DATA LOCAL INPATH 'src/main/resources/user.txt' INTO TABLE user") userSparkDF = spark.sql("select * from user") userPandasDF = userSparkDF.toPandas() print userPandasDF spark.stop() 作者:wangqiaoshi 链接:https://www.jianshu.com/p/fd528b78d17e 来源:简书 简书著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。

浙公网安备 33010602011771号