schema to data frame using pyspark

schema to data frame using pyspark - pandas

I have this schema:
schm = StructType([
StructField("User ID", IntegerType(), True),
StructField("Tag", StringType(), True),
StructField("Activated", StringType(), True),
StructField("Created Date", StringType(), True),
StructField("Updated Date", StringType(), True),
StructField("Valid Until", StringType(), True),
StructField("last used", StringType(), True),
StructField("reference", StringType(), True),
StructField("employee code", IntegerType(), True),
StructField("Unique user ID", IntegerType(), True),
])
schm
I want to build a dataframe for 998 values.But the problem is i can't understand how to get input data for 998 rows(same value for 998 rows):
df = create_df(spark, input_data, schm)

Related

I want to take user input in Spark dataframe but giving error ['list' object is not callable]

I can not take any inputI have a schema.
schm = StructType([
StructField("ID", IntegerType(), True),
StructField("fixed_credit_limit", IntegerType(), True),
StructField("credit_limit", IntegerType(), True),
StructField("due_amount", IntegerType(), True),
StructField("created_date", StringType(), True),
StructField("updated_date", StringType(), True),
StructField("agent_name_id",IntegerType(), True),
StructField("rfid_id", StringType(), True),
])
input=[(13158,100,100,0,'05/29/2021 11:01:31','05/29/2021 11:01:31',5,'862b4497-577f-47f9-8725-dd6c397ce408')]
df1 = spark.createDataFrame(input, schema)
I want to take the user input of agent_name_id but it gives the error ['list' object is not callable]
how can I take the user input of agent_name_id.

Spark: How to define a nested schema?

I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).
Here is a screenshot of the schema I am working with:
Here is what I have so far:
jsonSchema = StructType([StructField("attribution", ArrayType(), True),
StructField("averagingPeriod", StructType(), True),
StructField("city", StringType(), True),
StructField("coordinates", StructType(), True),
StructField("country", StringType(), True),
StructField("date", StructType(), True),
StructField("location", StringType(), True),
StructField("mobile", BooleanType(), True),
StructField("parameter", StringType(), True),
StructField("sourceName", StringType(), True),
StructField("sourceType", StringType(), True),
StructField("unit", StringType(), True),
StructField("value", DoubleType(), True)
])
My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?
For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.

Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.
from pyspark.sql.types import *
jsonSchema = StructType([
StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True),
StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True),
# ... etc.
])

pyspark create dataframe from pandas with a column of list of tuples

I am trying to create a pyspark dataframe from pandas dataframe.
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType
a_dict = {0: [(0, 9.821), (1, 82.185)]}
a_pd = pd.DataFrame.from_dict(a_dict.items())
a_pd.columns = ["row_num", "val"]
a_str = StructType([StructField("id", IntegerType(), True), StructField("prob", DoubleType(), True)])
my_schema = StructType([ StructField("row_num", LongType(), True),StructField("val", list(a_str), True)]) # error
a_df = spark.createDataFrame(a_pd, schema=my_schema)
error:
AssertionError: dataType [StructField(id,IntegerType,true), StructField(prob,DoubleType,true)] should be an instance of <class 'pyspark.sql.types.DataType'>
How to define a valid schema of
list of tuple of (int, DoubleType)
so that it can be understood by pyspark?
thanks

For a list of values, you have to use ArrayType. Below is your code reproduced with examples.
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType
a_dict = {0: [(0, 9.821), (1, 82.185)],
1: [(0, 9.821), (1, 8.10), (3, 2.385)],
2: [(0, 9.821), (1, 1.4485), (4, 5.15), (5, 6.104)]}
a_pd = pd.DataFrame.from_dict(a_dict.items())
a_pd.columns = ["row_num", "val"]
print(a_pd.head())
a_str = StructType([StructField("id", IntegerType(), True), StructField("prob", DoubleType(), True)])
my_schema = StructType([StructField("row_num", LongType(), True), StructField("val", ArrayType(a_str), True)]) # error
a_df = sqlContext.createDataFrame(a_pd, schema=my_schema)
print(a_df.show(truncate=False))
print(a_df.printSchema())
Output:
+-------+------------------------------------------------+
|row_num|val |
+-------+------------------------------------------------+
|0 |[[0, 9.821], [1, 82.185]] |
|1 |[[0, 9.821], [1, 8.1], [3, 2.385]] |
|2 |[[0, 9.821], [1, 1.4485], [4, 5.15], [5, 6.104]]|
+-------+------------------------------------------------+
root
|-- row_num: long (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = true)
| | |-- prob: double (nullable = true)

Forecasting with facebook prophet using Pandas udf in spark

I am trying to use Facebook prophet in spark in an Zeppelin environment and I have tried to follow the exact steps from https://github.com/facebook/prophet/issues/517, However, i get errors like below. I am simply not sure what am I to correct here or how to debug this.
My data contains a datetime features called ds, volume that I want to predict y and the segment and I am trying to build a model for each segment.
File"/hadoop14/yarn/nm/usercache/khasbab/appcache/application_1588090646020_2412/container_e168_1588090646020_2412_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o3737.showString.
%livycd.pyspark
from pyspark.sql.types import StructType,StructField,StringType,TimestampType,ArrayType,DoubleType
from pyspark.sql.functions import current_date
from pyspark.sql.functions import pandas_udf, PandasUDFType
from fbprophet import Prophet
from datetime import datetime
import pandas as pd
result_schema = StructType([
StructField('segment', StringType(), True),
StructField('ds', TimestampType(), True),
StructField('trend', ArrayType(DoubleType()), True),
StructField('trend_upper', ArrayType(DoubleType()), True),
StructField('trend_lower', ArrayType(DoubleType()), True),
StructField('yearly', ArrayType(DoubleType()), True),
StructField('yearly_upper', ArrayType(DoubleType()), True),
StructField('yearly_lower', ArrayType(DoubleType()), True),
StructField('yhat', ArrayType(DoubleType()), True),
StructField('yhat_upper', ArrayType(DoubleType()), True),
StructField('yhat_lower', ArrayType(DoubleType()), True),
StructField('multiplicative_terms', ArrayType(DoubleType()), True),
StructField('multiplicative_terms_upper', ArrayType(DoubleType()), True),
StructField('multiplicative_terms_lower', ArrayType(DoubleType()), True),
StructField('additive_terms', ArrayType(DoubleType()), True),
StructField('additive_terms_upper', ArrayType(DoubleType()), True),
StructField('additive_terms_lower', ArrayType(DoubleType()), True),
])
#pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast_loans(history_pd):
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
seasonality_mode='multiplicative'
)
#history_pd['ds'] = pd.to_datetime(history_pd['ds'], errors = 'coerce', format = '%Y-%m-%d')
#.apply(lambda x: datetime.strptime(x,'%Y-%m-%d'))
# fit the model
model.fit(history_pd.loc[:,['ds','y']])
# configure predictions
future_pd = model.make_future_dataframe(
periods=20,
freq='W')
# make predictions
results_pd = model.predict(future_pd)
# return predictions
return pd.DataFrame({
'segment':history_pd['segment'].values[0],
'ds': [results_pd.loc[:,'ds'].values.tolist()],
'trend': [results_pd.loc[:,'ds'].values.tolist()],
'trend_upper': [results_pd.loc[:,'trend_upper'].values.tolist()],
'trend_lower': [results_pd.loc[:,'trend_lower'].values.tolist()],
'yearly': [results_pd.loc[:,'yearly'].values.tolist()],
'yearly_upper': [results_pd.loc[:,'yearly_upper'].values.tolist()],
'yearly_lower': [results_pd.loc[:,'yearly_lower'].values.tolist()],
'yhat': [results_pd.loc[:,'yhat'].values.tolist()],
'yhat_upper': [results_pd.loc[:,'yhat_upper'].values.tolist()],
'yhat_lower': [results_pd.loc[:,'yhat_lower'].values.tolist()],
'multiplicative_terms': [results_pd.loc[:,'multiplicative_terms'].values.tolist()],
'multiplicative_terms_upper': [results_pd.loc[:,'multiplicative_terms_upper'].values.tolist()],
'multiplicative_terms_lower': [results_pd.loc[:,'multiplicative_terms_lower'].values.tolist()],
'additive_terms': [results_pd.loc[:,'additive_terms'].values.tolist()],
'additive_terms_upper': [results_pd.loc[:,'additive_terms_upper'].values.tolist()],
'additive_terms_lower': [results_pd.loc[:,'additive_terms_lower'].values.tolist()]
})
#return pd.concat([pd.DataFrame(results_pd),pd.DataFrame(history_pd[['segment']].values[0])], axis = 1)
results =df3.groupBy('segment').apply(forecast_loans)
results.show()

I have tweaked my code to the following and downgraded to pyarrow 0.14 as suggested here Pandas scalar UDF failing, IllegalArgumentException and it all worked! I believe downgrading pyarrow to 0.14 was the key for spark 2.x versions as commented on stackoverflow.
The comment says the following "The issue is not with pyarrow's new release, it is spark which has to upgrade and become compatible with pyarrow.(i am afraid we have to wait for spark 3.0 to use the latest pyarrow)"
%livycd.pyspark
from pyspark.sql.types import StructType,StructField,StringType,TimestampType,ArrayType,DoubleType
from pyspark.sql.functions import current_date
from pyspark.sql.functions import pandas_udf, PandasUDFType
from fbprophet import Prophet
from datetime import datetime
import pandas as pd
result_schema = StructType([
StructField('segment', StringType(), True),
StructField('ds', TimestampType(), True),
StructField('trend', DoubleType(), True),
StructField('trend_upper', DoubleType(), True),
StructField('trend_lower', DoubleType(), True),
StructField('yearly', DoubleType(), True),
StructField('yearly_upper', DoubleType(), True),
StructField('yearly_lower', DoubleType(), True),
StructField('yhat', DoubleType(), True),
StructField('yhat_upper', DoubleType(), True),
StructField('yhat_lower', DoubleType(), True),
StructField('multiplicative_terms', DoubleType(), True),
StructField('multiplicative_terms_upper', DoubleType(), True),
StructField('multiplicative_terms_lower', DoubleType(), True),
StructField('additive_terms', DoubleType(), True),
StructField('additive_terms_upper', DoubleType(), True),
StructField('additive_terms_lower', DoubleType(), True),
])
#pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def forecast_loans(df):
def prophet_model(df,test_start_date):
df['ds'] = pd.to_datetime(df['ds'])
# train
ts_train = (df
.query('ds < #test_start_date')
.sort_values('ds')
)
# test
ts_test = (df
.query('ds >= #test_start_date')
.sort_values('ds')
.drop('y', axis=1)
)
print(ts_test.columns)
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(ts_train.loc[:,['ds','y']])
# configure predictions
future_pd = model.make_future_dataframe(
periods=len(ts_test),
freq='W')
# make predictions
results_pd = model.predict(future_pd)
results_pd = pd.concat([results_pd,df['segment']],axis = 1)
return pd.DataFrame(results_pd, columns = result_schema.fieldNames())
# return predictions
return prophet_model(df, test_start_date= '2019-03-31')
results =df3.groupBy('segment').apply(forecast_loans)

Assuming you are using Spark 2.3.x or 2.4.x and PyArrow >= 0.15.0, there is a known compatibility issue between PyArrow and Spark.
The simplest solution is to set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1. The Spark documentation recommends setting it in conf/spark-env.sh, but you can set it in your Linux shell, and it is also possible to set it before creating your spark_session in your Python script or shell.
import os
os.ENVIRON["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
spark_session = ...
Alternatively, you can downgrade PyArrow if that is an option for you, as noted in the other answer.

Spark SQL Schema

I have this RDD in PySpark and i want to make the schema.
Example of 1 row of RDD collected:
(('16/12/2006', '17:24:00', 4.216, 0.418, 234.84, 18.4, 0.0, 1.0, 17.0), 0)
customSchema = StructType([
StructField("Date", StringType(), True),
StructField("Hour", StringType(), True),
StructField("ActivePower", FloatType(), True),
StructField("ReactivePower", FloatType(), True),
StructField("Voltage", FloatType(), True),
StructField("Instensity", FloatType(), True),
StructField("Sub1", FloatType(), True),
StructField("Sub2", FloatType(), True),
StructField("Sub3", FloatType(), True),
StructField("ID", IntegerType(), True)])
The problem is that the Index (last zero) is out of the tuple of data and I don't know how to make the schema correctly.
Thank you in advance.

You're almost there. you just need another StructField:
data = [
(('16/12/2006', '17:24:00', 4.216, 0.418, 234.84, 18.4, 0.0, 1.0, 17.0), 0)
]
schema = StructType([
StructField("values", StructType([
StructField("Date", StringType(), True),
StructField("Hour", StringType(), True),
StructField("ActivePower", FloatType(), True),
StructField("ReactivePower", FloatType(), True),
StructField("Voltage", FloatType(), True),
StructField("Instensity", FloatType(), True),
StructField("Sub1", FloatType(), True),
StructField("Sub2", FloatType(), True),
StructField("Sub3", FloatType(), True),
])),
StructField("ID", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
df.printSchema()
root
|-- values: struct (nullable = true)
| |-- Date: string (nullable = true)
| |-- Hour: string (nullable = true)
| |-- ActivePower: float (nullable = true)
| |-- ReactivePower: float (nullable = true)
| |-- Voltage: float (nullable = true)
| |-- Instensity: float (nullable = true)
| |-- Sub1: float (nullable = true)
| |-- Sub2: float (nullable = true)
| |-- Sub3: float (nullable = true)
|-- ID: integer (nullable = true)
df.show(1, False)
+----------------------------------------------------------+---+
|values |ID |
+----------------------------------------------------------+---+
|[16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0]|0 |
+----------------------------------------------------------+---+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

schema to data frame using pyspark - pandas

Related

I want to take user input in Spark dataframe but giving error ['list' object is not callable]

Spark: How to define a nested schema?

pyspark create dataframe from pandas with a column of list of tuples

Forecasting with facebook prophet using Pandas udf in spark

Spark SQL Schema

Categories

Resources