conditional aggragation in pySpark groupby - apache-spark-sql

Easy question from a newbie in pySpark:
I have a df and I would like make a conditional aggragation, returning the aggregation result if denominator is different than 0 otherwise 0.
My tentative produces an error:
groupBy=["K"]
exprs=[(sum("A")+(sum("B"))/sum("C") if sum("C")!=0 else 0 ]
grouped_df=new_df.groupby(*groupBy).agg(*exprs)
Any hint?
Thank you

You have to use when/otherwise for if/else:
import pyspark.sql.functions as psf
new_df.groupby("K").agg(
psf.when(psf.sum("C")==0, psf.lit(0)).otherwise((psf.sum("A") + psf.sum("B"))/psf.sum("C")).alias("sum")
)
But you can also do it this way:
import pyspark.sql.functions as psf
new_df.groupby("K").agg(
((psf.sum("A") + psf.sum("B"))/psf.sum("C")).alias("sum")
).na.fill({"sum": 0})

Related

Pandas series pad function not working with apply in pandas

I am trying to write the code to pad columns of my pandas dataframe with different characters. I tried to use apply function to fill '0' with zfill and it works.
print(df["Date"].apply(lambda x: x.zfill(10)))
But when I try to use pad function using apply method to my dataframe I face error:
AttributeError: 'str' object has no attribute 'pad'
The code I am trying is:
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0")))
Both the zfill and pad functions are a part of pandas.Series.str. I am confused why pad is not working and zfill works. How can I achieve this functionality?
Full code:
import pandas as pd
from io import StringIO
StringData = StringIO(
"""Date,Time
パンダ,パンダ
パンダサンDA12-3,パンダーサンDA12-3
パンダサンDA12-3,パンダサンDA12-3
"""
)
df = pd.read_csv(StringData, sep=",")
print(df["Date"].apply(lambda x: x.zfill(10))) -- works
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0"))) -- doesn't work
I am using pandas 1.5.1.
You should just not use apply, doing so you don't benefit from Series methods, but rather use pure python str methods:
print(df["Date"].str.zfill(10))
print(df["Date"].str.pad(10, side="left", fillchar="0"))
output:
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
multiple columns:
Now, you need to use apply, but this is DataFrame.apply, not Series.apply:
df[['col1', 'col2', 'col3']].apply(lambda s: s.str.pad(10, side="left", fillchar="0"))

Converting Python code to pyspark environment

How can I have the same functions as shift() and cumsum() from pandas in pyspark?
import pandas as pd
temp = pd.DataFrame(data=[['a',0],['a',0],['a',0],['b',0],['b',1],['b',1],['c',1],['c',0],['c',0]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID').apply(lambda x: (x["X"].shift() != x["X"]).cumsum()).reset_index()['X']
print(temp)
My question is how to achieve in pyspark.
Pyspark have handle these type of queries with Windows utility functions.
you can read its documentation here
Your pyspark code would be something like this :
from pyspark.sql import functions as F
from pyspark.sql Import Window as W
window = W.partitionBy('id').orderBy('time'?)
new_df = (
df
.withColumn('shifted', F.lag('X').over(window))
.withColumn('isEqualToPrev', (F.col('shifted') == F.col('X')).cast('int'))
.withColumn('cumsum', F.sum('isEqualToPrev').over(window))
)

compare after lower and trim df column in pyspark

I want to compare dataframe column after trim and convert it into lower case in pyspark.
is below code is wrong ?
if f.trim(Loc_Country_df.LOC_NAME.lower) == f.trim(sdf.location_name.lower):
print('y')
else:
print('N')
No you can't do like this because df columns are not just variable , they are collection of values(Iterable).
The best way is you can perform join.
join_df=join_df.withColumn("LOC_NAME",trim(col("LOC_NAME")))
sdf=sdf.withColumn("location_name",trim(col("location_name")))
join_df=Loc_Country_df.join(sdf,Loc_Country_df.LOC_NAME==sdf.location_name,"left")
from pyspark.sql import functions as f
join_df.withColumn('Result', f.when(f.col('LOC_NAME') == 0, "N").otherwise("Y")).show()

Is there an equivalent of 'REGEXP_SUBSTR' of SnowFlake in PySpark?

Is there an equivalent of Snowflake's REGEXP_SUBSTR in PySpark/spark-sql?
REGEXP_EXTRACT exists, but that doesn't support as many parameters as are supported by REGEXP_SUBSTR.
Here is a link to REGEXP_SUBSTR.
Here is a link to REGEXP_EXTRACT.
More specifically, I'm looking for alternatives for position, occurrence and regex parameters which are supported by Snowflake's REGEXP_SUBSTR.
position: Number of characters from the beginning of the string where the function starts searching for matches.
occurrence: Specifies which occurrence of the pattern to match. The function skips the first occurrence - 1 matches.
regex_parameters: I'm looking specifically for the parameter 'e', which does the following:
extract sub-matches.
So the query is something like:
REGEXP_SUBSTR(string, pattern, 1, 2, 'e', 2).
Sample Input: It was the best of times, it was the worst in times.
Expected output: worst
Assuming string1 = It was the best of times, it was the worst in times.
Equivalent SF query:
SELECT regexp_substr(string1, 'the(\\W+)(\\w+)', 1, 2, 'e', 2)
One of the best things about Spark is that you don't have to rely on a vendor to create a library of functions for you. You can create a User Defined Function in python and use it in a Spark SQL Statement. EG staring with
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import broadcast,col, lit, concat, udf
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
import re
def regexp_substr(subject:str, pattern:str, position:int,occurance:int) -> str:
s = subject[position:]
searchResult = re.search(pattern,s)
if searchResult:
return searchResult.group(occurance)
return None
#bench testing the python function
string1 = 'It was the best of times, it was the worst in times.'
pattern = 'the(\W+)(\w+)'
# print(pattern)
rv = regexp_substr(string1, pattern, 1,2)
print(rv)
# register for use in python
regexp_substr_udf = udf(regexp_substr , StringType())
# register for use in Spark SQL
spark.udf.register("REGEXP_SUBSTR", regexp_substr, StringType())
#craeate a spark DataFrame
df = spark.range(100).withColumn("s",lit(string1))
df.createOrReplaceTempView("df")
then you can run Spark SQL queries like
%%sql
select *, REGEXP_SUBSTR(s,'the(\\W+)(\\w+)',1,2) ex from df

What's the equivalent of Panda's value_counts() in PySpark?

I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy object.
How do I do this action in PySpark?
It's more or less the same:
spark_df.groupBy('column_name').count().orderBy('count')
In the groupBy you can have multiple columns delimited by a ,
For example groupBy('column_1', 'column_2')
try this when you want to control the order:
data.groupBy('col_name').count().orderBy('count', ascending=False).show()
Try this:
spark_df.groupBy('column_name').count().show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()
but spark is much slower than pandas value_counts() on a single machine
df.groupBy('column_name').count().orderBy('count').show()