How do I create a new column has the count of all the row values that are greater than 0 in pyspark? - dataframe

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+

You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

Related

AttributeError: 'DataFrame' object has no attribute 'pivot'

I have PySpark dataframe:
user_id
item_id
last_watch_dt
total_dur
watched_pct
1
1
2021-05-11
4250
72
1
2
2021-05-11
80
99
2
3
2021-05-11
1000
80
2
4
2021-05-11
5000
40
I used this code:
df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')
To get this:
1
2
3
4
1
72
99
0
0
2
0
0
80
40
But I got an error:
AttributeError: 'DataFrame' object has no attribute 'pivot'
What did I do wrong?
You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.
pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():
df.groupBy("user_id").pivot("item_id")
This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).
df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, '2021-05-11', 4250, 72),
(1, 2, '2021-05-11', 80, 99),
(2, 3, '2021-05-11', 1000, 80),
(2, 4, '2021-05-11', 5000, 40)],
['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])
df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id| 1| 2| 3| 4|
# +-------+----+----+----+----+
# | 1| 72| 99|null|null|
# | 2|null|null| 80| 40|
# +-------+----+----+----+----+
If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.
df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id| 1| 2| 3| 4|
# +-------+---+---+---+---+
# | 1| 72| 99| 0| 0|
# | 2| 0| 0| 80| 40|
# +-------+---+---+---+---+

pyspark extra column where dates are trasformed to 1, 2 , 3

I have a dataframe with dates in the format YYYYMM.
These start from 201801.
I now want to add a column where 201801 = 1, 201802 = 2 and so on up until the most recent month which is updated every month.
Kind regards,
wokter
months_between can be used:
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [
[201801],
[201802],
[201804],
[201812],
[202001],
[202010]
]
df = spark.createDataFrame(data, schema=["yyyymm"])
df.withColumn("months", F.months_between(
F.to_date(F.col("yyyymm").cast(T.StringType()), "yyyyMM"), F.lit("2017-12-01")
).cast(T.IntegerType())).show()
Output:
+------+------+
|yyyymm|months|
+------+------+
|201801| 1|
|201802| 2|
|201804| 4|
|201812| 12|
|202001| 25|
|202010| 34|
+------+------+

Finding largest number of location IDs per hour from each zone

I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this below:
Location hour Zone
97 0 A
49 5 B
97 0 A
10 6 D
25 5 B
97 0 A
97 3 A
What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location
So the answer should look something like this:
hour Zone max_count
0 A 3
1 B 4
2 A 6
3 D 1
. . .
. . .
23 D 8
What I first tried was to use an intermediate step to figure out the counts per zone and hour
val df_temp = df.select("Location","hour","Zone")
.groupBy("hour","Zone").agg(count($"Location").alias("count"))
This gives me a dataframe that looks like this:
hour Zone count
3 A 5
8 B 9
3 B 2
23 F 8
23 A 1
23 C 4
3 D 12
. . .
. . .
I then tried doing the following:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours","Zone").agg(max($"count").alias("max_count")).orderBy($"hours")
This doesn't do anything except just grouping by hours and zone but I still have 1000s of rows. I also tried:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours").agg(max($"count").alias("max_count")).orderBy($"hours")
The above gives me the max count and 24 rows from 0-23 but there is no Zone column there. So the answer looks like this:
hour max_count
0 12
1 15
. .
. .
23 8
I would like the Zone column included so I know which zone had the max count for each of those hours. I was also looking into the window function to do rank but I wasn't sure how to use it.
After generating the dataframe with per-hour/zone "count", you could generate another dataframe with per-hour "max_count" and join the two dataframes on "hour" and "max_count":
val df = Seq(
(97, 0, "A"),
(49, 5, "B"),
(97, 0, "A"),
(10, 6, "D"),
(25, 5, "B"),
(97, 0, "A"),
(97, 3, "A"),
(10, 0, "C"),
(20, 5, "C")
).toDF("location", "hour", "zone")
val dfC = df.groupBy($"hour", $"zone").agg(count($"location").as("count"))
val dfM = dfC.groupBy($"hour".as("m_hour")).agg(max($"count").as("max_count"))
dfC.
join(dfM, dfC("hour") === dfM("m_hour") && dfC("count") === dfM("max_count")).
drop("m_hour", "count").
orderBy("hour").
show
// +----+----+---------+
// |hour|zone|max_count|
// +----+----+---------+
// | 0| A| 3|
// | 3| A| 1|
// | 5| B| 2|
// | 6| D| 1|
// +----+----+---------+
Alternatively, you could perform the per-hour/zone groupBy followed by a Window partitioning by "hour" to compute "max_count" for the where condition, as shown below:
import org.apache.spark.sql.expressions.Window
df.
groupBy($"hour", $"zone").agg(count($"location").as("count")).
withColumn("max_count", max($"count").over(Window.partitionBy("hour"))).
where($"count" === $"max_count").
drop("count").
orderBy("hour")
You can use spark window functions for this task.
At first you can group by the data to get a count of number of zones.
val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))
Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.
val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)
val rankZone = rank().over(byZoneName)
This will perform the operation and list out the rank of all the zones grouped by hour.
val result_df = df.select($"*", rankZone as "rank")
The output will be something like this:
+----+----+-----------+----+
|hour|zone|count_order|rank|
+----+----+-----------+----+
| 0| A| 3| 1|
| 0| C| 2| 2|
| 0| B| 1| 3|
| 3| A| 1| 1|
| 5| B| 2| 1|
| 6| D| 1| 1|
+----+----+-----------+----+
You can then filter out the data with rank 1.
result_df.filter($"rank" === 1).orderBy("hour").show()
You can check my code here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html

pyspark dataframe sum

I am trying to perform the following operation on pyspark.sql.dataframe
from pyspark.sql.functions import sum as spark_sum
df = spark.createDataFrame([
('a', 1.0, 1.0), ('a',1.0, 0.2), ('b', 1.0, 1.0),
('c' ,1.0, 0.5), ('d', 0.55, 1.0),('e', 1.0, 1.0)
])
>>> df.show()
+---+----+---+
| _1| _2| _3|
+---+----+---+
| a| 1.0|1.0|
| a| 1.0|0.2|
| b| 1.0|1.0|
| c| 1.0|0.5|
| d|0.55|1.0|
| e| 1.0|1.0|
+---+----+---+
Then, I am trying to do the following operation.
1) Select the rows when column df[_2] > df[_3]
2) For each row of selected from above, multiply df[_2] * df[_3], then take their sum
3) divide the result from above by the sum of column of df[_3]
Here is what I did:
>>> filter_df = df.where(df['_2'] > df['_3'])
>>> filter_df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a|1.0|0.2|
| c|1.0|0.5|
+---+---+---+
>>> result = spark_sum(filter_df['_2'] * filter_df['_3'])
/ spark_sum(filter_df['_3'])
>>> df.select(result).show()
+--------------------------+
|(sum((_2 * _3)) / sum(_3))|
+--------------------------+
| 0.9042553191489361|
+--------------------------+
But the answer should be (1.0 * 0.2 + 1.0 * 0.5) / (0.2+0.5) = 1.0
This is not correct. What??
It seems to me that such operation only taken on the original df, but not the filter_df. WTF?
You need to call it in filter_df.
>>> result = spark_sum(filter_df['_2'] * filter_df['_3'])
/ spark_sum(filter_df['_3'])
This is a transformation function which returns a column and gets applied on dataframe we apply it (lazy evaluation). Sum is an aggregate function and when called without any groups, it applies on whole dataset.
>>> filter_df.select(result).show()
+--------------------------+
|(sum((_2 * _3)) / sum(_3))|
+--------------------------+
| 1.0|
+--------------------------+

fill na with random numbers in Pyspark

I'm using Pyspark DataFrame.
I'd like to update NA values in Age column with a random value in the range 14 to 46.
How can I do it?
Mara's answer is correct if you would like to replace the null values with the same random number, but if you'd like a random value for each age, you should do something coalesce and F.rand() as illustrated below:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from random import randint
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df = (df
.withColumn("x4", F.lit(None).cast(IntegerType()))
.withColumn("x5", F.lit(None).cast(IntegerType()))
)
df.na.fill({'x4':randint(0,100)}).show()
df.withColumn('x5', F.coalesce(F.col('x5'), (F.round(F.rand()*100)))).show()
+---+---+-----+---+----+
| x1| x2| x3| x4| x5|
+---+---+-----+---+----+
| 1| a| 23.0| 9|null|
| 3| B|-23.0| 9|null|
+---+---+-----+---+----+
+---+---+-----+----+----+
| x1| x2| x3| x4| x5|
+---+---+-----+----+----+
| 1| a| 23.0|null|44.0|
| 3| B|-23.0|null| 2.0|
+---+---+-----+----+----+
The randint function is what you need: it generates a random integer between two numbers. Apply it in the fillna spark function for the 'age' column.
from random import randint
df.fillna(randint(14, 46), 'age').show()