Cannot write dataframe into hive table after using UDF in Pyspark - numpy

I am trying to extract first element of probability column (vector data type) using UDF in Pyspark. I was able to get the new dataframe with extracted values in probability column. I also checked the data type of probability column which has changed from vector to float. But I am not able to write the dataframe into hive table. I get numpy module not found error. Deploy mode is client. Is there a workaround other than installing numpy in all the worker nodes ?
Code -
spark = (SparkSession
.builder
.appName("Model_Scoring")
.master('yarn')
.enableHiveSupport()
.getOrCreate()
)
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("hc360_models")
final_ads = spark.read.parquet("hdfs://DATAHUB/datahube/feature_engineering/final_ads.parquet")
model = PipelineModel.load("/tmp/fitted_model_new/")
first_element=udf(lambda v:float(v[0]),FloatType())
out = model.transform(final_ads)
out = out.withColumn("probability",first_element("probability")).drop('features').drop('rawPrediction')
out.show(10)
out.write.mode("append").format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).option("table","test_hypertension_table_final3").save()
spark.stop()
Error -
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7f83370922a8>, (u'{"type":"struct","fields":[{"name":"_0","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{}},{"name":"_1","type":"integer","nullable":true,"metadata":{}}]}',))
Sample data -
Schema -
StructType(List(StructField(patient_id,IntegerType,true),
StructField(carrier_operational_id,IntegerType,false),
StructField(gender_cde,StringType,true),
StructField(pre_fixed_mpr_qty,DecimalType(38,8),true),
StructField(idx_days_in_gap,DecimalType(11,1),true),
StructField(age,DecimalType(6,1),true),
StructField(post_fixed_mpr_adh_ind,DecimalType(2,1),true),
StructField(probability,FloatType,true),
StructField(prediction,DoubleType,false),
StructField(run_date,TimestampType,false)))

Related

Error in converting spark dataframe to pandas dataframe

I am using an external jar on my spark cluster to load a DataFrame. I am getting an error in attempting to convert this DataFrame to a pandas DataFrame.
The code is as follows :
jvm = spark._jvm
gateway = jvm.com.<external.package.classname>
data_w = gateway.loadData()
df = DataFrame(data_w, spark)
pandas_df = df.toPandas()
The spark dataframe df has valid data. However I am getting an error in the pandas conversion
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py", line 67, in toPandas
'RuntimeConfig' object has no attribute 'sessionLocalTimeZone'
The spark dataframe has 2 date columns. Is there any jconf settings I need to add on my spark context for conversion ?
I can see https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/pandas/conversion.html that there is a usage of sessionLocalTimeZone. Do we need to explicitly set this on spark jconf. I am not setting this locally and it works fine.

Having error while trying to convert pandas data frame into spark dataframe in Azure

I am having issues while trying to convert pandas data frame into spark data frame in Azure. I have done it in similar ways before and it worked, but not sure why it's not working. FYI, this was a pivot table which I converted into pandas dataframe by using reset_index, but still showing error. The code I used,
# Convert pandas dataframe to spark data frame
spark_Forecast_report122 = spark.createDataFrame(df_top_six_dup1.astype(str))
# write the data into table
spark_Forecast_report122.write.mode("overwrite").saveAsTable("default.spark_Forecast_report122")
sdff_Forecast_report122 = spark.read.table("spark_Forecast_report122")
Forecast_Price_df122 = sdff_Forecast_report122.toPandas()
display(Forecast_Price_df122)
I am attaching the error a images.Image_1

How to override type inference for partition columns in Hive partitioned dataset using PyArrow?

I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:
ArrowInvalid: error parsing '3760212050' as scalar of type int32
This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.
How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?
Example of dataset partition values that causes this problem:
/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet
Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")
The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])
I figured out since my original post that I can do what I want with the PyArrow API directly like this:
from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))
df = ParquetDataset("mydataset",
filters=filters,
partitioning=partitioning,
use_legacy_dataset=False).read_pandas().to_pandas()
I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Create Spark DataFrame from Pandas DataFrames inside RDD

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science