I am trying to use pyspark.pandas to read excel and I need to convert the pandas dataframe to pyspark dataframe.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True)
pyspark_df= spark.createDataFrame(df)
when I do this, I got error
TypeError: Can not infer schema for type:
Even though I tried to specify the dtype for the read_excel and define the schema. I still have the error.
df = panndas .read_excel(filepath,sheet_name="A", skiprows=12 ,usecols="B:AM",parse_dates=True,dtype= dtypetest)
pyspark_df= spark.createDataFrame(df,schema)
Would you tell me how to solve it?
I am having issues while trying to convert pandas data frame into spark data frame in Azure. I have done it in similar ways before and it worked, but not sure why it's not working. FYI, this was a pivot table which I converted into pandas dataframe by using reset_index, but still showing error. The code I used,
# Convert pandas dataframe to spark data frame
spark_Forecast_report122 = spark.createDataFrame(df_top_six_dup1.astype(str))
# write the data into table
spark_Forecast_report122.write.mode("overwrite").saveAsTable("default.spark_Forecast_report122")
sdff_Forecast_report122 = spark.read.table("spark_Forecast_report122")
Forecast_Price_df122 = sdff_Forecast_report122.toPandas()
display(Forecast_Price_df122)
I am attaching the error a images.Image_1
I read in data from Snowflake into AWS Glue using spark, which results having a spark dataframe called df. After that I added the following to convert it to a pandas dataframe:
df2 = df.toPandas()
However, this is causing an error in AWS Glue.
I am trying to extract first element of probability column (vector data type) using UDF in Pyspark. I was able to get the new dataframe with extracted values in probability column. I also checked the data type of probability column which has changed from vector to float. But I am not able to write the dataframe into hive table. I get numpy module not found error. Deploy mode is client. Is there a workaround other than installing numpy in all the worker nodes ?
Code -
spark = (SparkSession
.builder
.appName("Model_Scoring")
.master('yarn')
.enableHiveSupport()
.getOrCreate()
)
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("hc360_models")
final_ads = spark.read.parquet("hdfs://DATAHUB/datahube/feature_engineering/final_ads.parquet")
model = PipelineModel.load("/tmp/fitted_model_new/")
first_element=udf(lambda v:float(v[0]),FloatType())
out = model.transform(final_ads)
out = out.withColumn("probability",first_element("probability")).drop('features').drop('rawPrediction')
out.show(10)
out.write.mode("append").format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).option("table","test_hypertension_table_final3").save()
spark.stop()
Error -
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7f83370922a8>, (u'{"type":"struct","fields":[{"name":"_0","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{}},{"name":"_1","type":"integer","nullable":true,"metadata":{}}]}',))
Sample data -
Schema -
StructType(List(StructField(patient_id,IntegerType,true),
StructField(carrier_operational_id,IntegerType,false),
StructField(gender_cde,StringType,true),
StructField(pre_fixed_mpr_qty,DecimalType(38,8),true),
StructField(idx_days_in_gap,DecimalType(11,1),true),
StructField(age,DecimalType(6,1),true),
StructField(post_fixed_mpr_adh_ind,DecimalType(2,1),true),
StructField(probability,FloatType,true),
StructField(prediction,DoubleType,false),
StructField(run_date,TimestampType,false)))
I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science