I am trying to deploy a simple if-else function specifically using pandas_udf.
Here is the code:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
#pandas_udf("string", PandasUDFType.SCALAR )
def seq_sum1(col1,col2):
if col1 + col2 <= 6:
v = "low"
elif ((col1 + col2 > 6) & (col1 + col2 <=10)) :
v = "medium"
else:
v = "High"
return (v)
# Deploy
df.select("*",seq_sum1('c1','c2').alias('new_col')).show(10)
this results in an error:
PythonException: An exception was thrown from a UDF: 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', from <command-1220380192863042>, line 13. Full traceback below:
if I deploy the same code but using #udf instead of #pandas_udf, it produces the results as expected.
However, pandas_udf doesn't seem to work.
I know that this kind of functionally can be achieved through other means in spark (case when etc), so the point here is that I want to understand how pandas_udf works when dealing with such logics.
Thanks
The UDF should take a pandas series and return a pandas series, not taking and returning strings.
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
#F.pandas_udf("string", F.PandasUDFType.SCALAR)
def seq_sum1(col1, col2):
return pd.Series(
np.where(
col1 + col2 <= 6, "low",
np.where(
(col1 + col2 > 6) & (col1 + col2 <= 10), "medium",
"high"
)
)
)
df.select("*", seq_sum1('c1','c2').alias('new_col')).show()
+---+---+-------+
| c1| c2|new_col|
+---+---+-------+
| 1| 2| low|
| 3| 4| medium|
| 5| 6| high|
+---+---+-------+
#mck provided the insight, and I end up using the map function to solve it.
#pandas_udf("string", PandasUDFType.SCALAR)
def seq_sum(col1):
# actual function/calculation goes here
def main(x):
if x < 6:
v = "low"
else:
v = "high"
return(v)
# now apply map function, returning a panda series
result = pd.Series(map(main,col1))
return (result)
df.select("*",seq_sum('column_name').alias('new_col')).show(10)
Related
I have the data below and need to create a line chart of x = Date and y = count.
The code I used to create the dataframe below was from another dataframe.
df7=df7.select("*",
concat(col("Month"),lit("/"),col("Year")).alias("Date"))
df7.show()
I've imported matplotlib.pyplot as plt and am still getting errors.
The code to plot I used in different variations as below:
df.plot(x = 'Date', y = 'Count')
df.plot(kind = 'line')
I keep getting this error though:
AttributeError: 'DataFrame' object has no attribute 'plt'/'plot'
Please note that using df_pd= df.toPandas() is sometimes expensive, and if you deal with a high number of records like a scale of M, you might face OOM error in Databricks medium or your session could be crashed due to a lack of RAM memory of the drive. Long story short, by using toPandas(), in fact, you are not using spark-based or distributed computation resources anymore! So alternatively, you can follow below approach:
So let's start with a simple example:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.functions import dayofmonth, dayofweek
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", 10),
("2021-08-12 04:15:06", 17),
("2021-08-13 09:15:26", 25),
("2021-08-14 11:04:06", 68),
("2021-08-15 14:55:16", 50),
("2021-08-16 04:12:11", 2),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("count", IntegerType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
sdf.printSchema()
sdf.show(truncate=False)
#Generate date and timestamp
new_df = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'count')
new_df.show(truncate = False)
#root
# |-- timestamp: string (nullable = true)
# |-- count: integer (nullable = true)
#+-------------------+-----+
#|timestamp |count|
#+-------------------+-----+
#|2021-08-11 04:05:06|10 |
#|2021-08-12 04:15:06|17 |
#|2021-08-13 09:15:26|25 |
#|2021-08-14 11:04:06|68 |
#|2021-08-15 14:55:16|50 |
#|2021-08-16 04:12:11|2 |
#+-------------------+-----+
#+-------------------+----------+-----+
#|timestamp |date |count|
#+-------------------+----------+-----+
#|2021-08-11 04:05:06|2021-08-11|10 |
#|2021-08-12 04:15:06|2021-08-12|17 |
#|2021-08-13 09:15:26|2021-08-13|25 |
#|2021-08-14 11:04:06|2021-08-14|68 |
#|2021-08-15 14:55:16|2021-08-15|50 |
#|2021-08-16 04:12:11|2021-08-16|2 |
#+-------------------+----------+-----+
Now you need to collect() the values of the columns you want to reflect your plot in the absence of Pandas; of course, this is expensive and takes (a long) time in big data records, but it works. Now you can apply one of the following ways:
#for big\high # of records
xlabels = new_df.select("timestamp").rdd.flatMap(list).collect()
ylabels = new_df.select("count").rdd.flatMap(list).collect()
#for limited # of records
xlabels = [val.timestamp for val in new_df.select('timestamp').collect()]
ylabels = [val.count for val in new_df.select('count').collect()]
To plot:
import matplotlib.pyplot as plt
import matplotlib.dates as md
fig, ax = plt.subplots(figsize=(10,6))
plt.plot(xlabels, ylabels, color='blue', label="event's count") #, marker="o"
plt.scatter(xlabels, ylabels, color='cyan', marker='d', s=70)
plt.xticks(rotation=45)
plt.ylabel('Event counts \n# of records', fontsize=15)
plt.xlabel('timestamp', fontsize=15)
plt.title('Events over time', fontsize=15, color='darkred', weight='bold')
plt.legend(['# of records'], loc='upper right')
plt.show()
Based on comments, I assumed due to having lots of records that are printed under x-axis timestamps are not readable like the below pic:
To resolve this, you need to use the following approach to arrange x-axis ticks properly so that they would not plot on top of each other or ultimately side-by-side:
import pandas as pd
import matplotlib.pyplot as plt
x=xlabels
y=ylabels
#Note 1: if you use Pandas dataFrame after .toPandas()
#x=df['timestamp']
#y=df['count']
##Note 2: if you use Pandas dataFrame after .toPandas()
# convert the datetime column to a datetime type and assign it back to the column
#df.timestamp = pd.to_datetime(df.timestamp)
#verify timestamp column type is datetime64[ns] using print(df.info())
fig, ax = plt.subplots( figsize=(12,8))
plt.plot(x, y)
ax.legend(['# of records'])
ax.set_xlabel('Timestamp')
ax.set_ylabel('Event counts \n# of records')
# beautify the x-labels
import matplotlib.dates as md
plt.gcf().autofmt_xdate()
myFmt = md.DateFormatter('%Y-%m-%d %H:%M:%S.%f')
plt.gca().xaxis.set_major_formatter(myFmt)
plt.show()
plt.close()
so I have a large dataset (about 1 TB+) where I have to do many operations for which I have thought of using pyspark for faster processing. Here are my imports:
import numpy as np
import pandas as pd
try:
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
except ImportError as e:
raise ImportError('PySpark is not Configured')
print(f"PySpark Version : {pyspark.__version__}")
# Creating a Spark-Context
sc = SparkContext.getOrCreate(SparkConf().setMaster('local[*]'))
# Spark Builder
spark = SparkSession.builder \
.appName('MBLSRProcessor') \
.config('spark.executor.memory', '10GB') \
.getOrCreate()
# SQL Context - for SQL Query Executions
sqlContext = SQLContext(sc)
>> PySpark Version : 2.4.7
Now, I want to apply log10 function on two columns - For demonstrations, please consider this data:
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
data.head(5)
>> [Row(A=1, B=4), Row(A=2, B=3), Row(A=3, B=6), Row(A=4, B=1), Row(A=5, B=8)]
This is what I require: log10(A + B) i.e. log10(6 + 4) = 1 for which I have made a function like this:
def add(a, b):
# this is for demonstration
return np.sum([a, b])
data = data.withColumn("ADD", add(data.A, data.B))
data.head(5)
>> [Row(A=1, B=4, ADD=5), Row(A=2, B=3, ADD=5), Row(A=3, B=6, ADD=9), Row(A=4, B=1, ADD=5), Row(A=5, B=8, ADD=13)]
But, I can't do the same for np.log10:
def np_log(a, b):
# actual function
return np.log10(np.sum([a, b]))
data = data.withColumn("LOG", np_log(data.A, data.B))
data.head(5)
TypeError Traceback (most recent call last)
<ipython-input-13-a5726b6c7dc2> in <module>
----> 1 data = data.withColumn("LOG", np_log(data.A, data.B))
2 data.head(5)
<ipython-input-12-0e020707faae> in np_log(a, b)
1 def np_log(a, b):
----> 2 return np.log10(np.sum([a, b]))
TypeError: loop of ufunc does not support argument 0 of type Column which has no callable log10 method
The best way to do this is to use native Spark functions:
import pyspark.sql.functions as F
import pandas as pd
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
data = data.withColumn("LOG", F.log10(F.col('A') + F.col('B')))
But if you want, you can also use a UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
import numpy as np
import pandas as pd
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
def udf_np_log(a, b):
# actual function
return float(np.log10(np.sum([a, b])))
np_log = F.udf(udf_np_log, FloatType())
data = data.withColumn("LOG", np_log(data.A, data.B))
+---+---+---------+
| A| B| LOG|
+---+---+---------+
| 1| 4| 0.69897|
| 2| 3| 0.69897|
| 3| 6|0.9542425|
| 4| 1| 0.69897|
| 5| 8|1.1139433|
+---+---+---------+
Interestingly it works for np.sum without UDF because I guess np.sum is just calling the + operator, which is valid for spark dataframe columns.
How can a function be applied on a pandas groupby that requires parameters from multiple columns of the groupby dataframe and returns two scaler values.
Below is the repeatable example. The last line gets the f_value
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import plotly.express as px
n=100
df = pd.DataFrame({
'c': np.random.choice(['CATS', 'DOGS'], n),
'x': np.random.choice(list('ABCDE'), n),
'y': np.random.normal(5, 1, n)
})
signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal
def get_ols_fp(df, x, y):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return (f_value, p_value)
# getting f_value and p_value works with a single series.
get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')
This above code works and fetches the f_value and the p_value. However, the following does not work.
# how could we run the get_ols with a groupby().agg()
df.groupby('c').agg(get_ols_fp('x', 'y'))
The desired output would be a dataframe one row per level of the 'c' variable ('CATTS' and 'DOGS') in this case and one column for the p_value, and another for the f_value.
This is working :
def get_ols_fp(df, x=None, y=None):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return pd.Series([f_value, p_value], index=['f_value', 'p_value'])
df.groupby('c').apply(get_ols_fp, x='x', y = 'y')
I'd do it a little different.
I don't know if it's the easiest way, but it works.
Example:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
n=100
df = pd.DataFrame({
'c': np.random.choice(['CATS', 'DOGS'], n),
'x': np.random.choice(list('ABCDE'), n),
'y': np.random.normal(5, 1, n)
})
signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal
def get_ols_fp(df, x, y):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return (f_value, p_value)
# getting f_value and p_value works with a single series.
# get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')
df_result = pd.DataFrame([], columns = ["c", "f_value", "p_value"])
for c, dd in df.groupby(['c']):
v = get_ols_fp(dd, 'x', 'y')
df_result.loc[len(df_result)] = [c, *v]
df_result
I have a PYSPARK dataframe df with values 'latitude' and 'longitude':
+---------+---------+
| latitude|longitude|
+---------+---------+
|51.822872| 4.905615|
|51.819645| 4.961687|
| 51.81964| 4.961713|
| 51.82256| 4.911187|
|51.819263| 4.904488|
+---------+---------+
I want to get the UTM coordinates ('x' and 'y') from the dataframe columns. To do this, I need to feed the values 'longitude' and 'latitude' to the following function from pyproj. The result 'x' and 'y' should then be append to the original dataframe df. This is how I did it in Pandas:
from pyproj import Proj
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
xx, yy = pp(df["longitude"].values, df["latitude"].values)
df["X"] = xx
df["Y"] = yy
How would I do this in Pyspark?
Use pandas_udf, feed the function with an array and then return an array as well. see below:
from pyspark.sql.functions import array, pandas_udf, PandasUDFType
from pyproj import Proj
from pandas import Series
#pandas_udf('array<double>', PandasUDFType.SCALAR)
def get_utm(x):
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
return Series([ pp(e[0], e[1]) for e in x ])
df.withColumn('utm', get_utm(array('longitude','latitude'))) \
.selectExpr("*", "utm[0] as X", "utm[1] as Y") \
.show()
Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5