Pyspark create a summary table with calculated values - apache-spark-sql

I have a data frame that looks like this:
+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
and I want to create a summary table which calculates the trip_rate for all the night trips and all the day trips (total_amount column divided by trip_distance). So the end result should look like this:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
Here is what I'm trying to do:
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
],
['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
I don't believe I'm even approaching it the right way. And I'm getting this error:(
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and 'tpep_pickup_datetime' is not an aggregate function.
Can someone help me know how to approach this to get that summary table?

from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
Without rounding off:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
(You have day_night in df2 construction code, but isDay in the display table. I'm considering the field name as day_night here.)

Related

How to apply condition on a spark dataframe as per need?

I am having a spark dataframe with below sample data.
+--------------+--------------+
| item_cd | item_nbr |
+--------------+--------------+
|20-10767-58V| 98003351|
|20-10087-58V| 87003872|
|20-10087-58V| 97098411|
|20-10i72-YTW| 99003351|
|27-1o121-YTW| 89659352|
|27-10991-YTW| 98678411|
| At81kk00| 98903458|
| Avp12225| 85903458|
| Akb12226| 99003458|
| Ahh12829| 98073458|
| Aff12230| 88803458|
| Ar412231| 92003458|
| Aju12244| 98773458|
+--------------+--------------+
I want to write a condition like for each item_cd which are having hypen(-) should do nothing and for which not having hypen(-) should add 4 trailing 0's to each item_cd. Then take duplicates on both columns(item_cd, item_nbr) into to one dataframe and unique into other dataframe in pyspark.
could anyone please me with this in pyspark?
Here is how it could be done:
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [("20-10767-58V", "98003351"), ("20-10087-58V", "87003872"), ("At81kk00", "98903458"), ("Ahh12829", "98073458"), ("20-10767-58V", "98003351")]
cols = ["item_cd", "item_nbr"]
df = spark.createDataFrame(data, cols)
df.show()
df = df.withColumn("item_cd", when(~df.item_cd.contains("-"), F.concat(df.item_cd, F.lit("0000"))).otherwise(df.item_cd))
df.show()
unique_df = df.select("*").distinct()
unique_df.show()
w = Window.partitionBy(df.columns)
duplicate_df = df.select("*", F.count("*").over(w).alias("cnt"))\
.where("cnt > 1")\
.drop("cnt")
duplicate_df.show()
Input df (added duplicate):
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10087-58V|87003872|
| At81kk00|98903458|
| Ahh12829|98073458|
|20-10767-58V|98003351|
+------------+--------+
Unique df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|Ahh128290000|98073458|
|20-10767-58V|98003351|
|20-10087-58V|87003872|
|At81kk000000|98903458|
+------------+--------+
Duplicates df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10767-58V|98003351|
+------------+--------+

Create new column with fuzzy-score across two string columns in the same dataframe

I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+

Aggregate while dropping duplicates in pyspark

I want to groupby aggregate a pyspark dataframe, while removing duplicates (keep last value) based on another column of this dataframe.
In summary, I would like to apply a dropDuplicates to a GroupedData object. So, for each group, I could keep only one row by some column, dynamically.
Example
The straight forward group aggregation, for the dataframe bellow, would be:
from pyspark.sql import functions
dataframe = spark.createDataFrame(
[
(1, "2020-01-01", 1, 1),
(2, "2020-01-01", 2, 1),
(3, "2020-01-02", 1, 1),
(2, "2020-01-02", 1, 1)
],
("id", "ts", "feature", "h3")
).withColumn("ts", functions.col("ts").cast("timestamp"))
# +---+-------------------+-------+---+
# | id| ts|feature| h3|
# +---+-------------------+-------+---+
# | 1|2020-01-01 00:00:00| 1| 1|
# | 2|2020-01-01 00:00:00| 2| 1|
# | 3|2020-01-02 00:00:00| 1| 1|
# | 2|2020-01-02 00:00:00| 1| 1|
# +---+-------------------+-------+---+
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sum("feature")
)
aggregated.show(truncate=False)
resulting in the following dataframe:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|5 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|5 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
The problem
I want the aggregation to use only the latest state of each id. In this case, id=2 have been updated to feature=1 at ts=2020-01-02 00:00:00, so all aggregations with base timestamp bigger than 2020-01-02 00:00:00 should use only this state for column feature when id=2. The expected aggregated dataframe is:
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|3 |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|2 |
+---+------------------------------------------+------------+
How can I do this with pyspark?
Update
I have assumed that a MapType variable should not have duplicate keys in Spark. With that assumption, I thought I could aggregate the column creating a map id -> feature and then just aggregate the map values with sum (or whatever the final aggregation should be).
So I did:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.map_from_entries(
functions.collect_list(
functions.struct("id","feature")
)
).alias("id_feature")
)
aggregated.show(truncate=False)
But then I've found that maps can have duplicate keys:
+---+------------------------------------------+--------------------------------+
|h3 |window |id_feature |
+---+------------------------------------------+--------------------------------+
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|[1 -> 1, 2 -> 2, 3 -> 1, 2 -> 1]|
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|[1 -> 1, 2 -> 2] |
|1 |[2020-01-02 00:00:00, 2020-01-05 00:00:00]|[3 -> 1, 2 -> 1] |
+---+------------------------------------------+--------------------------------+
so it doesn't solve my problem. Instead, I just found another problem. When using the display function in a Databricks' notebook, it shows the MapType column without duplicated keys.
First, you can find the latest record for each id and time window and then join with the original dataframe with the latest records.
time_window = window(timeColumn="ts", windowDuration="3 days", slideDuration="1 day")
df2 = df.groupBy("h3", time_window, "id").agg(max("ts").alias("latest"))
df2.alias("a").join(df.alias("b"), (col("a.id") == col("b.id")) & (col("a.latest") == col("b.ts")), "left") \
.select("a.*", "feature") \
.groupBy("h3", "window") \
.agg(sum("feature")) \
.orderBy("window") \
.show(truncate=False)
Then, the result is the same as your expected one.
+---+------------------------------------------+------------+
|h3 |window |sum(feature)|
+---+------------------------------------------+------------+
|1 |[2019-12-29 00:00:00, 2020-01-01 00:00:00]|3 |
|1 |[2019-12-30 00:00:00, 2020-01-02 00:00:00]|3 |
|1 |[2019-12-31 00:00:00, 2020-01-03 00:00:00]|3 |
|1 |[2020-01-01 00:00:00, 2020-01-04 00:00:00]|2 |
+---+------------------------------------------+------------+
Since you are using Spark 2.4+, one way you can try is to use Spark SQL aggregate function, see below:
aggregated = dataframe.groupby("h3",
functions.window(
timeColumn="ts",
windowDuration="3 days",
slideDuration="1 day",
)
).agg(
functions.sort_array(functions.collect_list(
functions.struct("ts", "id", "feature")
), False).alias("id_feature")
)
I added ts field into the resulting array of structs from functions.collect_list. use functions.sort_array to sort the list by ts in descending order(to keep the latest record if duplicate exists). In the following aggregate function, we set the zero_value using a named_struct containing two fields: ids (MapType) to cache all processed id and total to do the sum only when the new id not exist in the cached ids.
aggregated.selectExpr("h3", "window", """
aggregate(
id_feature,
/* zero_value */
(map() as ids, 0L as total),
/* merge */
(acc, y) -> named_struct(
/* add y.id into the ids map */
'ids', map_concat(acc.ids, map(y.id,1)),
/* sum to total only when y.id doesn't exist in acc.ids map */
'total', acc.total + IF(acc.ids[y.id] is null,y.feature,0)
),
/* finish, take only acc.total, discard acc.ids map */
acc -> acc.total
) as id_features
""").show()
+---+--------------------+----------+
| h3| window|id_feature|
+---+--------------------+----------+
| 1|[2020-01-01 00:00...| 3|
| 1|[2019-12-31 00:00...| 3|
| 1|[2019-12-30 00:00...| 3|
| 1|[2020-01-02 00:00...| 2|
+---+--------------------+----------+

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Get distinct rows by creation date

I am working with a dataframe like this:
DeviceNumber | CreationDate | Name
1001 | 1.1.2018 | Testdevice
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:
DeviceNumber | CreationDate | Name
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.
import pyspark.sql.functions as psf
max_df = df\
.groupBy('DeviceNumber')\
.agg(psf.max('CreationDate').alias('max_CreationDate'))
and then join max_df with original dataframe.
joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]
df.join(max_df,joining_condition,'left_semi').show()
left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.
You can use PySpark windowing functionality:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
# make sure that creation is a date data-type
df = df.withColumn('CreationDate', f.to_timestamp('CreationDate', format='dd.MM.yyyy'))
# partition on device and get a row number by (descending) date
win = Window.partitionBy('DeviceNumber').orderBy(f.col('CreationDate').desc())
df = df.withColumn('rownum', f.row_number().over(win))
# finally take the first row in each group
df.filter(df['rownum']==1).select('DeviceNumber', 'CreationDate', 'Name').show()
------------+------------+------+
|DeviceNumber|CreationDate| Name|
+------------+------------+------+
| 1002| 2019-01-01| Lamp|
| 1001| 2019-06-30|Device|
+------------+------------+------+