I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is :
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here
In pandas: code to pivot is :
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here
What i want is to convert above code to pyspark.
what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot. You can use multiple aggregations within the agg, per your requirements, post that.
Related
I have PySpark dataframe:
user_id
item_id
last_watch_dt
total_dur
watched_pct
1
1
2021-05-11
4250
72
1
2
2021-05-11
80
99
2
3
2021-05-11
1000
80
2
4
2021-05-11
5000
40
I used this code:
df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')
To get this:
1
2
3
4
1
72
99
0
0
2
0
0
80
40
But I got an error:
AttributeError: 'DataFrame' object has no attribute 'pivot'
What did I do wrong?
You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.
pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():
df.groupBy("user_id").pivot("item_id")
This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).
df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, '2021-05-11', 4250, 72),
(1, 2, '2021-05-11', 80, 99),
(2, 3, '2021-05-11', 1000, 80),
(2, 4, '2021-05-11', 5000, 40)],
['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])
df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id| 1| 2| 3| 4|
# +-------+----+----+----+----+
# | 1| 72| 99|null|null|
# | 2|null|null| 80| 40|
# +-------+----+----+----+----+
If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.
df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id| 1| 2| 3| 4|
# +-------+---+---+---+---+
# | 1| 72| 99| 0| 0|
# | 2| 0| 0| 80| 40|
# +-------+---+---+---+---+
I am working on datasets (having 20k distinct records) to join two data frames based on a identifier columns id_txt
df1.join(df2,df1.id_text== df2.id_text,"inner").select(df1['*'], df2['Name'].alias('DName'))
df1 has the following sample values in the identifier column id_text:
X North
Y South
Z West
Whereas df2 has the following sample values from identifier column id_text:
North X
South Y
West Z
Logically, the different values for id_text are correct. Hardcoding those values for 10k records is not a feasible solution. Is there any way id_text can be modified for df2 to be the same as in df1?
You can use a column expression directly inside join (it will not create an additional column). In this example, I used regexp_replace to switch places of both elements.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('X North', 1), ('Y South', 1), ('Z West', 1)], ['id_text', 'val1'])
df2 = spark.createDataFrame([('North X', 2), ('South Y', 2), ('West Z', 2)], ['id_text', 'Name'])
# df1 df2
# +-------+----+ +-------+----+
# |id_text|val1| |id_text|Name|
# +-------+----+ +-------+----+
# |X North| 1| |North X| 2|
# |Y South| 1| |South Y| 2|
# | Z West| 1| | West Z| 2|
# +-------+----+ +-------+----+
df = (df1
.join(df2, df1.id_text == F.regexp_replace(df2.id_text, r'(.+) (.+)', '$2 $1'), 'inner')
.select(df1['*'], df2.Name))
df.show()
# +-------+----+----+
# |id_text|val1|Name|
# +-------+----+----+
# |X North| 1| 2|
# |Y South| 1| 2|
# | Z West| 1| 2|
# +-------+----+----+
I have 2 dataframes:
df contains all train routes with origin and arrival columns (both ids and names)
df_relation contains Station (Gare) name, relation and API number.
Goal: I need to join these two dataframes twice on both origin and arrival columns.
I tried this:
df.groupBy("origin", "origin_id", "arrival", "direction") \
.agg({'time_travelled': 'avg'}) \
.filter(df.direction == 0) \
.join(df_relation, df.origin == df_relation.Gare, "inner") \
.join(df_relation, df.arrival == df_relation.Gare, "inner") \
.orderBy("Relation")
.show()
But I got the following AnalysisException
AnalysisException: Column Gare#1708 are ambiguous.
It's probably because you joined several Datasets together, and some of these Datasets are the same.
This column points to one of the Datasets but Spark is unable to figure out which one.
Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id").
You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
How to rewrite this?
I have unsuccessfully tried to blindly follow the error recommendation like this
.as("a").join(df_relation.as("b"), $"a.arrival" == $"b.Gare", "inner")
This is my first dataframe (df):
+--------------------+----------+--------------------+---------+-------------------+
| origin| origin_id| arrival|direction|avg(time_travelled)|
+--------------------+----------+--------------------+---------+-------------------+
| Gare du Nord|IDFM:71410|La Plaine Stade d...| 1.0| 262.22222222222223|
| Gare du Nord|IDFM:71410|Aéroport CDG 1 (T...| 1.0| 1587.7551020408164|
|Villeparisis - Mi...|IDFM:68916| Mitry - Claye| 1.0| 240.0|
| Villepinte|IDFM:73547|Parc des Expositions| 1.0| 90.33898305084746|
| Le Blanc-Mesnil|IDFM:72648| Aulnay-sous-Bois| 1.0| 105.04273504273505|
|Aéroport CDG 1 (T...|IDFM:73596|Aéroport Charles ...| 1.0| 145.27777777777777|
This my second dataframe (df_relation):
+-----------------------------------------+--------+--------+
|Gare |Relation|Gare Api|
+-----------------------------------------+--------+--------+
|Aéroport Charles de Gaulle 2 (Terminal 2)|1 |87001479|
|Aéroport CDG 1 (Terminal 3) - RER |2 |87271460|
|Parc des Expositions |3 |87271486|
|Villepinte |4 |87271452|
And this is what I am trying to achieve:
+--------------------+-----------+--------------------+---------+-------------------+--------+----------+-----------+
| origin| origin_id| arrival|direction|avg(time_travelled)|Relation|Api origin|Api arrival|
+--------------------+-----------+--------------------+---------+-------------------+--------+----------+-----------+
|Aéroport Charles ...| IDFM:73699|Aéroport CDG 1 (T...| 0.0| 110.09345794392523| 1| 87001479| 87271460|
|Aéroport CDG 1 (T...| IDFM:73596|Parc des Expositions| 0.0| 280.17543859649123| 2| 87271460| 87271486|
|Aéroport CDG 1 (T...| IDFM:73596| Gare du Nord| 0.0| 1707.4| 2| 87271460| 87271007|
|Parc des Expositions| IDFM:73568| Villepinte| 0.0| 90.17543859649123| 3| 87271486| 87271452|
| Villepinte| IDFM:73547| Sevran Beaudottes| 0.0| 112.45614035087719| 4| 87271452| 87271445|
| Sevran Beaudottes| IDFM:73491| Aulnay-sous-Bois| 0.0| 168.24561403508773| 5| 87271445| 87271411|
| Mitry - Claye| IDFM:69065|Villeparisis - Mi...| 0.0| 210.51724137931035| 6| 87271528| 87271510|
|Villeparisis - Mi...| IDFM:68916| Vert Galant| 0.0| 150.0| 7| 87271510| 87271510|
You take the original df and join df_relation twice. This way you create duplicate columns for every column in df_relation. Column "Gare" just happens to be the first of them, so it is depicted in the error message.
To avoid the error, you will have to create alias for your dataframes. Notice how I create them
df_agg.alias("agg")
df_relation.alias("rel_o")
df_relation.alias("rel_a")
and how I latter reference them before every column.
from pyspark.sql import functions as F
df_agg = (
df.filter(F.col("direction") == 0)
.groupBy("origin", "origin_id", "arrival", "direction")
.agg({"time_travelled": "avg"})
)
df_result = (
df_agg.alias("agg")
.join(df_relation.alias("rel_o"), F.col("agg.origin") == F.col("rel_o.Gare"), "inner")
.join(df_relation.alias("rel_a"), F.col("agg.arrival") == F.col("rel_a.Gare"), "inner")
.orderBy("rel_o.Relation")
.select(
"agg.*",
"rel_o.Relation",
F.col("rel_o.Gare Api").alias("Api origin"),
F.col("rel_a.Gare Api").alias("Api arrival"),
)
)
I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour. Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string. Something like this below:
Location hour Zone
97 0 A
49 5 B
97 0 A
10 6 D
25 5 B
97 0 A
97 3 A
What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location
So the answer should look something like this:
hour Zone max_count
0 A 3
1 B 4
2 A 6
3 D 1
. . .
. . .
23 D 8
What I first tried was to use an intermediate step to figure out the counts per zone and hour
val df_temp = df.select("Location","hour","Zone")
.groupBy("hour","Zone").agg(count($"Location").alias("count"))
This gives me a dataframe that looks like this:
hour Zone count
3 A 5
8 B 9
3 B 2
23 F 8
23 A 1
23 C 4
3 D 12
. . .
. . .
I then tried doing the following:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours","Zone").agg(max($"count").alias("max_count")).orderBy($"hours")
This doesn't do anything except just grouping by hours and zone but I still have 1000s of rows. I also tried:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours").agg(max($"count").alias("max_count")).orderBy($"hours")
The above gives me the max count and 24 rows from 0-23 but there is no Zone column there. So the answer looks like this:
hour max_count
0 12
1 15
. .
. .
23 8
I would like the Zone column included so I know which zone had the max count for each of those hours. I was also looking into the window function to do rank but I wasn't sure how to use it.
After generating the dataframe with per-hour/zone "count", you could generate another dataframe with per-hour "max_count" and join the two dataframes on "hour" and "max_count":
val df = Seq(
(97, 0, "A"),
(49, 5, "B"),
(97, 0, "A"),
(10, 6, "D"),
(25, 5, "B"),
(97, 0, "A"),
(97, 3, "A"),
(10, 0, "C"),
(20, 5, "C")
).toDF("location", "hour", "zone")
val dfC = df.groupBy($"hour", $"zone").agg(count($"location").as("count"))
val dfM = dfC.groupBy($"hour".as("m_hour")).agg(max($"count").as("max_count"))
dfC.
join(dfM, dfC("hour") === dfM("m_hour") && dfC("count") === dfM("max_count")).
drop("m_hour", "count").
orderBy("hour").
show
// +----+----+---------+
// |hour|zone|max_count|
// +----+----+---------+
// | 0| A| 3|
// | 3| A| 1|
// | 5| B| 2|
// | 6| D| 1|
// +----+----+---------+
Alternatively, you could perform the per-hour/zone groupBy followed by a Window partitioning by "hour" to compute "max_count" for the where condition, as shown below:
import org.apache.spark.sql.expressions.Window
df.
groupBy($"hour", $"zone").agg(count($"location").as("count")).
withColumn("max_count", max($"count").over(Window.partitionBy("hour"))).
where($"count" === $"max_count").
drop("count").
orderBy("hour")
You can use spark window functions for this task.
At first you can group by the data to get a count of number of zones.
val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))
Then create a window to partition the data by hour and order it by total count. Then you have to calculate the rank over this partition of data.
val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)
val rankZone = rank().over(byZoneName)
This will perform the operation and list out the rank of all the zones grouped by hour.
val result_df = df.select($"*", rankZone as "rank")
The output will be something like this:
+----+----+-----------+----+
|hour|zone|count_order|rank|
+----+----+-----------+----+
| 0| A| 3| 1|
| 0| C| 2| 2|
| 0| B| 1| 3|
| 3| A| 1| 1|
| 5| B| 2| 1|
| 6| D| 1| 1|
+----+----+-----------+----+
You can then filter out the data with rank 1.
result_df.filter($"rank" === 1).orderBy("hour").show()
You can check my code here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html
Currently, I have a table consisting of encounter_id and date field like so:
+---------------------------+--------------------------+
|encounter_id |date |
+---------------------------+--------------------------+
|random_id34234 |2018-09-17 21:53:08.999999|
|this_can_be_anything2432432|2018-09-18 18:37:57.000000|
|423432 |2018-09-11 21:00:36.000000|
+---------------------------+--------------------------+
encounter_id is a random string.
I'm aiming to create a column which consists of the total number of encounters in the past 30 days.
+---------------------------+--------------------------+---------------------------+
|encounter_id |date | encounters_in_past_30_days|
+---------------------------+--------------------------+---------------------------+
|random_id34234 |2018-09-17 21:53:08.999999| 2 |
|this_can_be_anything2432432|2018-09-18 18:37:57.000000| 3 |
|423432 |2018-09-11 21:00:36.000000| 1 |
+---------------------------+--------------------------+---------------------------+
Currently, I'm thinking of somehow using window functions and specifying an aggregate function.
Thanks for the time.
Here is one possible solution, I added some sample data. It indeed uses a window function, as you suggested yourself. Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame(
[
('A','2018-10-01 00:15:00'),
('B','2018-10-11 00:30:00'),
('C','2018-10-21 00:45:00'),
('D','2018-11-10 00:00:00'),
('E','2018-12-20 00:15:00'),
('F','2018-12-30 00:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.col('date').astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*30,0)
df = df.withColumn('encounters_past_30_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+-----------------------+
|encounter_id| date| timestamp|encounters_past_30_days|
+------------+-------------------+----------+-----------------------+
| A|2018-10-01 00:15:00|1538345700| 1|
| B|2018-10-11 00:30:00|1539210600| 2|
| C|2018-10-21 00:45:00|1540075500| 3|
| D|2018-11-10 00:00:00|1541804400| 2|
| E|2018-12-20 00:15:00|1545261300| 1|
| F|2018-12-30 00:30:00|1546126200| 2|
+------------+-------------------+----------+-----------------------+
EDIT: If you want to have days as the granularity, you could first convert your date column to the Date type. Example below, assuming that a window of five days means today and the four days before. If it should be today and the past five days just remove the -1.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
n_days = 5
df = sqlContext.createDataFrame(
[
('A','2018-10-01 23:15:00'),
('B','2018-10-02 00:30:00'),
('C','2018-10-05 05:45:00'),
('D','2018-10-06 00:15:00'),
('E','2018-10-07 00:15:00'),
('F','2018-10-10 21:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.to_date(F.col('date')).astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*(n_days-1),0)
df = df.withColumn('encounters_past_n_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+----------------------+
|encounter_id| date| timestamp|encounters_past_n_days|
+------------+-------------------+----------+----------------------+
| A|2018-10-01 23:15:00|1538344800| 1|
| B|2018-10-02 00:30:00|1538431200| 2|
| C|2018-10-05 05:45:00|1538690400| 3|
| D|2018-10-06 00:15:00|1538776800| 3|
| E|2018-10-07 00:15:00|1538863200| 3|
| F|2018-10-10 21:30:00|1539122400| 3|
+------------+-------------------+----------+----------------------+