Optimized way to find values with top 20 frequencies in spark dataframe - pandas

We have a spark dataframe. We are trying to find the values with top 20 frequencies in a column.
Ex) [1, 1, 1, 2, 2, 4]
In the above list,
1 is occuring 3 times
2 is occuring 2 times
4 is occuring 1 time
We are trying to find this using pandas.
And then creating UDF in spark and using it there.
This works for smaller datasets, but when the datasets are too tall (20M rows), we are facing memory issues sometimes.
from pyspark.sql import functions as F
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("unit_testing_spark").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
def find_freq_values(col_list):
if len(col_list) == 0:
return []
df = pd.DataFrame(col_list, columns=["value"])
df = df[['value']].groupby(['value'])['value'] \
.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(20)
res = df.to_dict(orient='records')
for curr_data in res:
curr_value = curr_data["value"]
ldt = str(type(curr_value)).lower()
if "time" in ldt or "date" in ldt:
curr_data["value"] = str(curr_value)
return res
s = find_freq_values([1, 1, 1, 2, 2, 4])
print(s)
# Output: [{'value': 1, 'count': 3}, {'value': 2, 'count': 2}, {'value': 4, 'count': 1}]
column_data = ["col_1"]
column_header = tuple(column_data)
data = [[1], [1], [1], [2], [2], [4]]
df = spark.createDataFrame(data, column_header)
find_freq_udf = F.udf(find_freq_values, ArrayType(MapType(StringType(), StringType(), True)))
freq_res_df = df.select(*[find_freq_udf(F.collect_list(c)).alias(c) for c in df.columns])
freq_res = freq_res_df.collect()[0].asDict()
print(freq_res)
# Output: {'col_1': [{'count': '3', 'value': '1'}, {'count': '2', 'value': '2'}, {'count': '1', 'value': '4'}]}
Error message:
"An error occurred while calling o514.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 10 times, most recent failure: Lost task 0.9 in stage 66.0 (TID 379) (w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Container from a bad node: container_1658234225716_0001_01_000011 on host: w382f6d7a114442a8bd741d53661a2c3b-srpz3ldim3sr2-w-1.c.projectid.internal. Exit status: 143.
To perform multiple columns parallely we are using the following statement.
How can this be optimized to avoid memory issue?

One way would be only use spark for computing the frequencies. here is one way to do it.
from pyspark.sql.functions import count, col
x = [1, 1, 1, 2, 2, 4] # input data
df = sc.parallelize(x).toDF(['ID'])
df = df.groupBy('ID')
df = df.agg(count('ID').alias('id_count'))
df = df.orderBy(col('id_count').asc())
df = df.limit(20)
df.show()

Related

How to count No. of rows with special characters in all columns of a PySpark DataFrame?

Assume that I have a PySpark DataFrame. Some of the cells contain only special characters.
Sample dataset:
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
I want to count the number of rows which contain only special characters (except blank cells). Values like 'A?' and '\B' and blank cells are not counted.
The expected output will be:
{'ID': 0, 'col_1': 3, 'col_2': 1}
Is there anyway to do that?
Taking your sample dataset, this should do:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.getOrCreate()
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
res = {}
for col_name in df.columns:
df = df.withColumn('matched', when((col(col_name).rlike('[^A-Za-z0-9\s]')) & ~(col(col_name).rlike('[A-Za-z0-9]')), True).otherwise(False))
res[col_name] = df.select('ID').where(df.matched).count()
print(res)
The trick is to use regular expressions with two conditions to filter the cells that are valid according to your logic.

Can we unwind list in pandas column

I have my pandas df like below. It has list in one of the column can it be unwind as below:?
import pandas as pd
L1 = [['ID1', 0, [0, 1, 1] , [0, 1]],
['ID2', 2, [1, 2, 3], [0, 1]]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key','Value'])
can this be unwinded like below?
import pandas as pd
L1 = [['ID1', 0, 0, 1, 1 , 0, 1],
['ID2', 2, 1, 2, 3, 0, 1]
]
df1 = pd.DataFrame(L1,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
Turning the Series into lists you can call the DataFrame constructor to explode them into multiple columns. Using pop in a list comprehension within concat will remove the original columns from your DataFrame so that you just join back the exploded versions.
This will work regardless of the number of elements in each list, and even if the lists have varying numbers of elements across rows.
df2 = pd.concat(([df1] + [pd.DataFrame(df1.pop(col).tolist(), index=df1.index).add_prefix(f'{col}_')
for col in ['Key', 'Value']]),
axis=1)
print(df2)
ID t Key_0 Key_1 Key_2 Value_0 Value_1
0 ID1 0 0 1 1 0 1
1 ID2 2 1 2 3 0 1
You can flatten L1 before constructing the data frame:
L2 = [ row[0:2] + row[2] + row[3] for row in L1 ]
df2 = pd.DataFrame(L2,columns=['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1'])
You can use explode the dataframe columns wise-
df3 = df1.apply(pd.Series.explode, axis=1)
df3.columns = ['ID', 't', 'Key_0','Key_1','Key_2','Value_0', 'Value_1']

Pandas dataframe rolling, custom calulation, df.iloc ValueError

How should I utilize the RangeIndex provided by pandas.DataFrame.rolling in custom_function?
Current implementation gives a ValueError.
At first x.index = RangeIndex(start=0, stop=2, step=1), and tmp_df correctly selects the first and second row in df (index 0 and 1). For the last x.index = RangeIndex(start=6, stop=8, step=1) it seems like iloc tries to select index 8 in df which is out of range (df has index 0 to 7).
Basically, what I want to do is to have the custom function to count consecutive numbers in the window. Given a positive values 1,0,1,1,1,0 in the window, the custom function should return 3 as there is a maximum of 3 consecutive 1s.
import numpy as np
import pandas as pd
df = pd.DataFrame({'open': [7, 5, 10, 11,6,13,17,12],
'close': [6, 6, 11, 10,7,15,18,10],
'positive': [0, 1, 1, 0,1,1,1,0]},
)
def custom_function(x,df):
print("index:",x.index)
tmp_df = df.iloc[x.index] # raises "ValueError: cannot set using a slice indexer with a different length than the value" when x.index = RangeIndex(start=6, stop=8, step=1) as df index goes from 0 to 7 only
# do calulations on any column in tmp_df, get result
result = 1 #dummyresult
return result
intervals = range(2, 10)
for i in intervals:
df['result_' + str(i)] = np.nan
res = df.rolling(i).apply(custom_function, args=(df,), raw=False)
df['result_' + str(i)][1:] = res
print(df)

How do I block a Keyerror in Python from reoccurring or create exception to handle it?

Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5

Invalid Syntax error pandas series

I am starting out with pandas on jupyter notebook. In the error message, there is a ^ below the = operator, but I cannot see the problem. What's missing? thanks!
import pandas as pd
data2 = ([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s = pd.Series(data2)
print(s.shape)
This is the error:
File "<ipython-input-30-57c99bd7e494>", line 4
data2 = ([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
^
SyntaxError: invalid syntax
There proper way to do this is, separate variables for data and index:
import pandas as pd
data2 = [1,2,3,4]
index = ['a','b','c','d']
s = pd.Series(data2,index)
print(s.shape)
Or as ayhan points our you could unpack a dictionary with **:
data2 = dict(data=[1,2,3,4], index=['a','b','c','d'])
s = pd.Series(**data2)
print(s.shape)