Unable to apply log function to a pyspark dataframe - numpy

so I have a large dataset (about 1 TB+) where I have to do many operations for which I have thought of using pyspark for faster processing. Here are my imports:
import numpy as np
import pandas as pd
try:
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
except ImportError as e:
raise ImportError('PySpark is not Configured')
print(f"PySpark Version : {pyspark.__version__}")
# Creating a Spark-Context
sc = SparkContext.getOrCreate(SparkConf().setMaster('local[*]'))
# Spark Builder
spark = SparkSession.builder \
.appName('MBLSRProcessor') \
.config('spark.executor.memory', '10GB') \
.getOrCreate()
# SQL Context - for SQL Query Executions
sqlContext = SQLContext(sc)
>> PySpark Version : 2.4.7
Now, I want to apply log10 function on two columns - For demonstrations, please consider this data:
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
data.head(5)
>> [Row(A=1, B=4), Row(A=2, B=3), Row(A=3, B=6), Row(A=4, B=1), Row(A=5, B=8)]
This is what I require: log10(A + B) i.e. log10(6 + 4) = 1 for which I have made a function like this:
def add(a, b):
# this is for demonstration
return np.sum([a, b])
data = data.withColumn("ADD", add(data.A, data.B))
data.head(5)
>> [Row(A=1, B=4, ADD=5), Row(A=2, B=3, ADD=5), Row(A=3, B=6, ADD=9), Row(A=4, B=1, ADD=5), Row(A=5, B=8, ADD=13)]
But, I can't do the same for np.log10:
def np_log(a, b):
# actual function
return np.log10(np.sum([a, b]))
data = data.withColumn("LOG", np_log(data.A, data.B))
data.head(5)
TypeError Traceback (most recent call last)
<ipython-input-13-a5726b6c7dc2> in <module>
----> 1 data = data.withColumn("LOG", np_log(data.A, data.B))
2 data.head(5)
<ipython-input-12-0e020707faae> in np_log(a, b)
1 def np_log(a, b):
----> 2 return np.log10(np.sum([a, b]))
TypeError: loop of ufunc does not support argument 0 of type Column which has no callable log10 method

The best way to do this is to use native Spark functions:
import pyspark.sql.functions as F
import pandas as pd
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
data = data.withColumn("LOG", F.log10(F.col('A') + F.col('B')))
But if you want, you can also use a UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
import numpy as np
import pandas as pd
data = spark.createDataFrame(pd.DataFrame({
"A" : [1, 2, 3, 4, 5],
"B" : [4, 3, 6, 1, 8]
}))
def udf_np_log(a, b):
# actual function
return float(np.log10(np.sum([a, b])))
np_log = F.udf(udf_np_log, FloatType())
data = data.withColumn("LOG", np_log(data.A, data.B))
+---+---+---------+
| A| B| LOG|
+---+---+---------+
| 1| 4| 0.69897|
| 2| 3| 0.69897|
| 3| 6|0.9542425|
| 4| 1| 0.69897|
| 5| 8|1.1139433|
+---+---+---------+
Interestingly it works for np.sum without UDF because I guess np.sum is just calling the + operator, which is valid for spark dataframe columns.

Related

Optimized way to find values with top 20 frequencies in spark dataframe

We have a spark dataframe. We are trying to find the values with top 20 frequencies in a column.
Ex) [1, 1, 1, 2, 2, 4]
In the above list,
1 is occuring 3 times
2 is occuring 2 times
4 is occuring 1 time
We are trying to find this using pandas.
And then creating UDF in spark and using it there.
This works for smaller datasets, but when the datasets are too tall (20M rows), we are facing memory issues sometimes.
from pyspark.sql import functions as F
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("unit_testing_spark").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
def find_freq_values(col_list):
if len(col_list) == 0:
return []
df = pd.DataFrame(col_list, columns=["value"])
df = df[['value']].groupby(['value'])['value'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(20)
res = df.to_dict(orient='records')
for curr_data in res:
curr_value = curr_data["value"]
ldt = str(type(curr_value)).lower()
if "time" in ldt or "date" in ldt:
curr_data["value"] = str(curr_value)
return res
s = find_freq_values([1, 1, 1, 2, 2, 4])
print(s)
# Output: [{'value': 1, 'count': 3}, {'value': 2, 'count': 2}, {'value': 4, 'count': 1}]
column_data = ["col_1"]
column_header = tuple(column_data)
data = [[1], [1], [1], [2], [2], [4]]
df = spark.createDataFrame(data, column_header)
find_freq_udf = F.udf(find_freq_values, ArrayType(MapType(StringType(), StringType(), True)))
freq_res_df = df.select(*[find_freq_udf(F.collect_list(c)).alias(c) for c in df.columns])
freq_res = freq_res_df.collect()[0].asDict()
print(freq_res)
# Output: {'col_1': [{'count': '3', 'value': '1'}, {'count': '2', 'value': '2'}, {'count': '1', 'value': '4'}]}
Error message:
"An error occurred while calling o514.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 10 times, most recent failure: Lost task 0.9 in stage 66.0 (TID 379) (w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container from a bad node: container_1658234225716_0001_01_000011 on host: w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal. Exit status: 143.
To perform multiple columns parallely we are using the following statement.
How can this be optimized to avoid memory issue?
One way would be only use spark for computing the frequencies. here is one way to do it.
from pyspark.sql.functions import count, col
x = [1, 1, 1, 2, 2, 4] # input data
df = sc.parallelize(x).toDF(['ID'])
df = df.groupBy('ID')
df = df.agg(count('ID').alias('id_count'))
df = df.orderBy(col('id_count').asc())
df = df.limit(20)
df.show()

Numpy equivalent of pandas replace (dictionary mapping)

I know working on numpy array can be quicker than pandas.
I am wondering if there is a equivalent way (and quicker) to do pandas.replace on a numpy array.
In the example below, I have created a dataframe and a dictionary. The dictionary contains the name of columns and its corresponding mapping. I wonder if there is any function which would allow me to feed a dicitonary to a numpy array to do the mapping and yield a quicker processing time?
import pandas as pd
import numpy as np
# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)
# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} , 'col2' : {4:1}}
# result using pandas replace
print(df.replace(d_mapping))
# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np = df.to_records(index=False)
You can try np.select(). I believe it depends on the number of unique elements to replace.
def replace_values(df, d_mapping):
def replace_col(col):
# extract numpy array and column name from pd.Series
col, name = col.values, col.name
# generate condlist and choicelist
# for every key in mapping create a boolean mask
condlist = [col == x for x in d_mapping[name].keys()]
choicelist = d_mapping[name].values()
# use np.where to keep the existing value which won't be replaced
return np.select(condlist, choicelist, col)
return df.apply(replace_col)
usage:
replace_values(df, d_mapping)
I also believe that you you can speed up the code above if you use lists/arrays in the mapping instead of dicts and replace keys(), and values() calls with index lookups:
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),
Upd:
Here is the benchmark
import pandas as pd
import numpy as np
# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}
def replace_values(df, mapping):
def replace_col(col):
col, (m0, m1) = col.values, mapping[col.name]
return np.select([col == x for x in m0], m1, col)
return df.apply(replace_col)
from timeit import timeit
print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))
On my 6-year old laptop it prints:
np.select: 3.6562702230003197
df.replace: 4.714512745998945
np.select is ~20% faster

Spark exception error using pandas_udf with logical statement

I am trying to deploy a simple if-else function specifically using pandas_udf.
Here is the code:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
#pandas_udf("string", PandasUDFType.SCALAR )
def seq_sum1(col1,col2):
if col1 + col2 <= 6:
v = "low"
elif ((col1 + col2 > 6) & (col1 + col2 <=10)) :
v = "medium"
else:
v = "High"
return (v)
# Deploy
df.select("*",seq_sum1('c1','c2').alias('new_col')).show(10)
this results in an error:
PythonException: An exception was thrown from a UDF: 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', from <command-1220380192863042>, line 13. Full traceback below:
if I deploy the same code but using #udf instead of #pandas_udf, it produces the results as expected.
However, pandas_udf doesn't seem to work.
I know that this kind of functionally can be achieved through other means in spark (case when etc), so the point here is that I want to understand how pandas_udf works when dealing with such logics.
Thanks
The UDF should take a pandas series and return a pandas series, not taking and returning strings.
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
#F.pandas_udf("string", F.PandasUDFType.SCALAR)
def seq_sum1(col1, col2):
return pd.Series(
np.where(
col1 + col2 <= 6, "low",
np.where(
(col1 + col2 > 6) & (col1 + col2 <= 10), "medium",
"high"
)
)
)
df.select("*", seq_sum1('c1','c2').alias('new_col')).show()
+---+---+-------+
| c1| c2|new_col|
+---+---+-------+
| 1| 2| low|
| 3| 4| medium|
| 5| 6| high|
+---+---+-------+
#mck provided the insight, and I end up using the map function to solve it.
#pandas_udf("string", PandasUDFType.SCALAR)
def seq_sum(col1):
# actual function/calculation goes here
def main(x):
if x < 6:
v = "low"
else:
v = "high"
return(v)
# now apply map function, returning a panda series
result = pd.Series(map(main,col1))
return (result)
df.select("*",seq_sum('column_name').alias('new_col')).show(10)

How to write to a csv within a pandas UDF in pyspark?

This is the code that I have tried to write the CSV file.
Spark data frame would be written into a CSV file using pandas
from pyspark.sql.functions import pandas_udf,PandasUDFType
import os
import csv
df3 = spark.createDataFrame(
[("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
("key", "value1", "value2")
)
from pyspark.sql.types import *
schema = StructType([
StructField("key", StringType()),
StructField("avg_value1", DoubleType()),
StructField("avg_value2", DoubleType()),
StructField("sum_avg", DoubleType()),
StructField("sub_avg", DoubleType()),
StructField("result", StringType())
])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def g(df):
gr = df['key'].iloc[0]
x = df.value1.mean()
y = df.value2.mean()
w = df.value1.mean() + df.value2.mean()
z = df.value1.mean() - df.value2.mean()
fileName = '/mnt/test' + gr + '.csv'
df.to_csv(fileName, sep='\t')
a = "Saved"
return pd.DataFrame([[gr]+[x]+[y]+[w]+[z]+[a]])
df3.groupby("key").apply(g).show()
Output:
+---+----------+----------+-------+-------+------+
|key|avg_value1|avg_value2|sum_avg|sub_avg|result|
+---+----------+----------+-------+-------+------+
| a| 0.0| 21.0| 21.0| -21.0| Saved|
| b| 6.5| -1.5| 5.0| 8.0| Saved|
+---+----------+----------+-------+-------+------+
But the CSV files are not getting created.
Any suggestions would be appreciated.

How do I block a Keyerror in Python from reoccurring or create exception to handle it?

Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5