Get distinct rows by creation date - dataframe

I am working with a dataframe like this:
DeviceNumber | CreationDate | Name
1001 | 1.1.2018 | Testdevice
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:
DeviceNumber | CreationDate | Name
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp

You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.
import pyspark.sql.functions as psf
max_df = df\
.groupBy('DeviceNumber')\
.agg(psf.max('CreationDate').alias('max_CreationDate'))
and then join max_df with original dataframe.
joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]
df.join(max_df,joining_condition,'left_semi').show()
left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.

You can use PySpark windowing functionality:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
# make sure that creation is a date data-type
df = df.withColumn('CreationDate', f.to_timestamp('CreationDate', format='dd.MM.yyyy'))
# partition on device and get a row number by (descending) date
win = Window.partitionBy('DeviceNumber').orderBy(f.col('CreationDate').desc())
df = df.withColumn('rownum', f.row_number().over(win))
# finally take the first row in each group
df.filter(df['rownum']==1).select('DeviceNumber', 'CreationDate', 'Name').show()
------------+------------+------+
|DeviceNumber|CreationDate| Name|
+------------+------------+------+
| 1002| 2019-01-01| Lamp|
| 1001| 2019-06-30|Device|
+------------+------------+------+

Related

How to apply condition on a spark dataframe as per need?

I am having a spark dataframe with below sample data.
+--------------+--------------+
| item_cd | item_nbr |
+--------------+--------------+
|20-10767-58V| 98003351|
|20-10087-58V| 87003872|
|20-10087-58V| 97098411|
|20-10i72-YTW| 99003351|
|27-1o121-YTW| 89659352|
|27-10991-YTW| 98678411|
| At81kk00| 98903458|
| Avp12225| 85903458|
| Akb12226| 99003458|
| Ahh12829| 98073458|
| Aff12230| 88803458|
| Ar412231| 92003458|
| Aju12244| 98773458|
+--------------+--------------+
I want to write a condition like for each item_cd which are having hypen(-) should do nothing and for which not having hypen(-) should add 4 trailing 0's to each item_cd. Then take duplicates on both columns(item_cd, item_nbr) into to one dataframe and unique into other dataframe in pyspark.
could anyone please me with this in pyspark?
Here is how it could be done:
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [("20-10767-58V", "98003351"), ("20-10087-58V", "87003872"), ("At81kk00", "98903458"), ("Ahh12829", "98073458"), ("20-10767-58V", "98003351")]
cols = ["item_cd", "item_nbr"]
df = spark.createDataFrame(data, cols)
df.show()
df = df.withColumn("item_cd", when(~df.item_cd.contains("-"), F.concat(df.item_cd, F.lit("0000"))).otherwise(df.item_cd))
df.show()
unique_df = df.select("*").distinct()
unique_df.show()
w = Window.partitionBy(df.columns)
duplicate_df = df.select("*", F.count("*").over(w).alias("cnt"))\
.where("cnt > 1")\
.drop("cnt")
duplicate_df.show()
Input df (added duplicate):
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10087-58V|87003872|
| At81kk00|98903458|
| Ahh12829|98073458|
|20-10767-58V|98003351|
+------------+--------+
Unique df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|Ahh128290000|98073458|
|20-10767-58V|98003351|
|20-10087-58V|87003872|
|At81kk000000|98903458|
+------------+--------+
Duplicates df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10767-58V|98003351|
+------------+--------+

Create new column with fuzzy-score across two string columns in the same dataframe

I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+

Loading csv with pandas, wrong columns

I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!
import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Joining two pyspark dataframes by unique values in a column

Let's say, I have two pyspark dataframes, users and shops. A few sample rows for both the dataframes are shown below.
users dataframe:
+---------+-------------+---------+
| idvalue | day-of-week | geohash |
+---------+-------------+---------+
| id-1 | 2 | gcutjjn |
| id-1 | 3 | gcutjjn |
| id-1 | 5 | gcutjht |
+---------+-------------+---------+
shops dataframe
+---------+-----------+---------+
| shop-id | shop-name | geohash |
+---------+-----------+---------+
| sid-1 | kfc | gcutjjn |
| sid-2 | mcd | gcutjhq |
| sid-3 | starbucks | gcutjht |
+---------+-----------+---------+
I need to join both of these dataframes on the geohash column. I can do a naive equi-join for sure, but the users dataframe is huge, containing billions of rows, and geohashes are likely to repeat, within and across idvalues. So, I was wondering if there's a way to perform joins on unique geohashes in the users dataframe and geohashes in the shops dataframe. If we can do that, then it's easy to replicate the shops entries for matching geohashes in resultant dataframe.
Probably it can be achieved with a pandas udf, where I would perform a groupby on users.idvalue, do a join with shops within the udf by only taking the first row from the group (because all ids are same anyway within the group), and creating a one row dataframe. Logically it feels like this should work, but not sure sure on the performance aspect as udf(s) are usually slower than spark native transformations. Any ideas are welcome.
You said that your Users dataframe is huge and that "geohashes are likely to repeat, within and across idvalues". You didn't referred however if there might be duplicated geohashes in your shops dataframe.
If there are no repeated hashes in the latter, I think that a simple join would solve your problem:
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+---------+-------+
|shop_id|shop_name|geohash|
+-------+---------+-------+
| sid-1| kfc|gcutjjn|
| sid-2| mcd|gcutjhq|
| sid-3|starbucks|gcutjht|
+-------+---------+-------+
shopDf
.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+------+
|geohash|shop_id|idvalue| days|
+-------+-------+-------+------+
|gcutjjn| sid-1| id-1|[2, 3]|
|gcutjht| sid-3| id-1| [5]|
|gcutjjn| sid-1| id-2| [2]|
+-------+-------+-------+------+
If you have repeated hash values in your shops dataframe, a possible approach would be to remove those repeated hashes from your shops dataframe (if your requirements allow this), and then perform the same join operation.
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht"),("sid-4","burguer king","gcutjjn")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+------------+-------+
|shop_id| shop_name|geohash|
+-------+------------+-------+
| sid-1| kfc|gcutjjn| << Duplicated geohash
| sid-2| mcd|gcutjhq|
| sid-3| starbucks|gcutjht|
| sid-4|burguer king|gcutjjn| << Duplicated geohash
+-------+------------+-------+
//Dataframe with hashes to exclude:
val excludedHashes = shopDf.groupBy("geohash").count.filter("count > 1")
excludedHashes.show
+-------+-----+
|geohash|count|
+-------+-----+
|gcutjjn| 2|
+-------+-----+
//Create a dataframe of shops without the ones with duplicated hashes
val cleanShopDf = shopDf.join(excludedHashes,Seq("geohash"),"left_anti")
cleanShopDf.show
+-------+-------+---------+
|geohash|shop_id|shop_name|
+-------+-------+---------+
|gcutjhq| sid-2| mcd|
|gcutjht| sid-3|starbucks|
+-------+-------+---------+
//Perform the same join operation
cleanShopDf.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+----+
|geohash|shop_id|idvalue|days|
+-------+-------+-------+----+
|gcutjht| sid-3| id-1| [5]|
+-------+-------+-------+----+
The code provided was written in Scala but it can be easily converted to Python.
Hope this helps!
This is an idea if it possible you used pyspark SQL to select distinct geohash and create to the tempory table. Then join from this table instead of dataframes.