AttributeError: 'DataFrame' object has no attribute 'pivot' - dataframe

I have PySpark dataframe:
user_id
item_id
last_watch_dt
total_dur
watched_pct
1
1
2021-05-11
4250
72
1
2
2021-05-11
80
99
2
3
2021-05-11
1000
80
2
4
2021-05-11
5000
40
I used this code:
df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')
To get this:
1
2
3
4
1
72
99
0
0
2
0
0
80
40
But I got an error:
AttributeError: 'DataFrame' object has no attribute 'pivot'
What did I do wrong?

You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.
pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():
df.groupBy("user_id").pivot("item_id")
This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).
df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, '2021-05-11', 4250, 72),
(1, 2, '2021-05-11', 80, 99),
(2, 3, '2021-05-11', 1000, 80),
(2, 4, '2021-05-11', 5000, 40)],
['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])
df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id| 1| 2| 3| 4|
# +-------+----+----+----+----+
# | 1| 72| 99|null|null|
# | 2|null|null| 80| 40|
# +-------+----+----+----+----+
If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.
df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id| 1| 2| 3| 4|
# +-------+---+---+---+---+
# | 1| 72| 99| 0| 0|
# | 2| 0| 0| 80| 40|
# +-------+---+---+---+---+

Related

How to Pivot multiple columns in pyspark similar to pandas

I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is :
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here
In pandas: code to pivot is :
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here
What i want is to convert above code to pyspark.
what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot. You can use multiple aggregations within the agg, per your requirements, post that.

Modify key column to match the join condition

I am working on datasets (having 20k distinct records) to join two data frames based on a identifier columns id_txt
df1.join(df2,df1.id_text== df2.id_text,"inner").select(df1['*'], df2['Name'].alias('DName'))
df1 has the following sample values in the identifier column id_text:
X North
Y South
Z West
Whereas df2 has the following sample values from identifier column id_text:
North X
South Y
West Z
Logically, the different values for id_text are correct. Hardcoding those values for 10k records is not a feasible solution. Is there any way id_text can be modified for df2 to be the same as in df1?
You can use a column expression directly inside join (it will not create an additional column). In this example, I used regexp_replace to switch places of both elements.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('X North', 1), ('Y South', 1), ('Z West', 1)], ['id_text', 'val1'])
df2 = spark.createDataFrame([('North X', 2), ('South Y', 2), ('West Z', 2)], ['id_text', 'Name'])
# df1 df2
# +-------+----+ +-------+----+
# |id_text|val1| |id_text|Name|
# +-------+----+ +-------+----+
# |X North| 1| |North X| 2|
# |Y South| 1| |South Y| 2|
# | Z West| 1| | West Z| 2|
# +-------+----+ +-------+----+
df = (df1
.join(df2, df1.id_text == F.regexp_replace(df2.id_text, r'(.+) (.+)', '$2 $1'), 'inner')
.select(df1['*'], df2.Name))
df.show()
# +-------+----+----+
# |id_text|val1|Name|
# +-------+----+----+
# |X North| 1| 2|
# |Y South| 1| 2|
# | Z West| 1| 2|
# +-------+----+----+

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.
Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())
Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+
You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

How to validate the date format of a column in Pyspark?

I am really new to Pyspark, I want to check if the column has the correct date format or not? How do I do it? I have tried though I am getting an error. Can anyone help me with this?
My code:
df =
Date name
0 12/12/2020 a
1 24/01/2019 b
2 08/09/2018 c
3 12/24/2020 d
4 Nan e
df_out= df.withColumn('output', F.when(F.to_date("Date","dd/mm/yyyy").isNotNull, Y).otherwise(No))
df_out.show()
gives me:
TypeError: condition should be a Column
You can filter out the rows after converting to date type.
Example:
df.show()
#+----------+----+
#| Date|name|
#+----------+----+
#|12/12/2020| a|
#|24/01/2019| b|
#|12/24/2020| d|
#| nan| e|
#+----------+----+
from pyspark.sql.functions import *
df.withColumn("output",to_date(col('Date'),'dd/MM/yyyy')).\
filter(col("output").isNotNull()).\
show()
#+----------+----+----------+
#| Date|name| output|
#+----------+----+----------+
#|12/12/2020| a|2020-12-12|
#|24/01/2019| b|2019-01-24|
#+----------+----+----------+
#without adding new column
df.filter(to_date(col('Date'),'dd/MM/yyyy').isNotNull()).show()
#+----------+----+
#| Date|name|
#+----------+----+
#|12/12/2020| a|
#|24/01/2019| b|
#+----------+----+

Finding largest number of location IDs per hour from each zone

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this below:
Location hour Zone
97 0 A
49 5 B
97 0 A
10 6 D
25 5 B
97 0 A
97 3 A
What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location
So the answer should look something like this:
hour Zone max_count
0 A 3
1 B 4
2 A 6
3 D 1
. . .
. . .
23 D 8
What I first tried was to use an intermediate step to figure out the counts per zone and hour
val df_temp = df.select("Location","hour","Zone")
.groupBy("hour","Zone").agg(count($"Location").alias("count"))
This gives me a dataframe that looks like this:
hour Zone count
3 A 5
8 B 9
3 B 2
23 F 8
23 A 1
23 C 4
3 D 12
. . .
. . .
I then tried doing the following:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours","Zone").agg(max($"count").alias("max_count")).orderBy($"hours")
This doesn't do anything except just grouping by hours and zone but I still have 1000s of rows. I also tried:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours").agg(max($"count").alias("max_count")).orderBy($"hours")
The above gives me the max count and 24 rows from 0-23 but there is no Zone column there. So the answer looks like this:
hour max_count
0 12
1 15
. .
. .
23 8
I would like the Zone column included so I know which zone had the max count for each of those hours. I was also looking into the window function to do rank but I wasn't sure how to use it.
After generating the dataframe with per-hour/zone "count", you could generate another dataframe with per-hour "max_count" and join the two dataframes on "hour" and "max_count":
val df = Seq(
(97, 0, "A"),
(49, 5, "B"),
(97, 0, "A"),
(10, 6, "D"),
(25, 5, "B"),
(97, 0, "A"),
(97, 3, "A"),
(10, 0, "C"),
(20, 5, "C")
).toDF("location", "hour", "zone")
val dfC = df.groupBy($"hour", $"zone").agg(count($"location").as("count"))
val dfM = dfC.groupBy($"hour".as("m_hour")).agg(max($"count").as("max_count"))
dfC.
join(dfM, dfC("hour") === dfM("m_hour") && dfC("count") === dfM("max_count")).
drop("m_hour", "count").
orderBy("hour").
show
// +----+----+---------+
// |hour|zone|max_count|
// +----+----+---------+
// | 0| A| 3|
// | 3| A| 1|
// | 5| B| 2|
// | 6| D| 1|
// +----+----+---------+
Alternatively, you could perform the per-hour/zone groupBy followed by a Window partitioning by "hour" to compute "max_count" for the where condition, as shown below:
import org.apache.spark.sql.expressions.Window
df.
groupBy($"hour", $"zone").agg(count($"location").as("count")).
withColumn("max_count", max($"count").over(Window.partitionBy("hour"))).
where($"count" === $"max_count").
drop("count").
orderBy("hour")
You can use spark window functions for this task.
At first you can group by the data to get a count of number of zones.
val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))
Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.
val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)
val rankZone = rank().over(byZoneName)
This will perform the operation and list out the rank of all the zones grouped by hour.
val result_df = df.select($"*", rankZone as "rank")
The output will be something like this:
+----+----+-----------+----+
|hour|zone|count_order|rank|
+----+----+-----------+----+
| 0| A| 3| 1|
| 0| C| 2| 2|
| 0| B| 1| 3|
| 3| A| 1| 1|
| 5| B| 2| 1|
| 6| D| 1| 1|
+----+----+-----------+----+
You can then filter out the data with rank 1.
result_df.filter($"rank" === 1).orderBy("hour").show()
You can check my code here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html