Select oldest column for grouped spark dataframe - dataframe

Given a dataframe (df) with the following columns:
id,
created_date,
name
I need to ensure that all rows with the same name have the same id. I can create a mapping from old id to new id (selected at 'random' using max).
df.groupBy('name')\
.agg(
func.max('id').alias('new_id'),
func.collect_set(id).alias('grouped_ids'))\
.filter(func.size('grouped_ids') > 1)\
.select(func.explode("grouped_ids").alias('old_id'), "new_id")\
.filter("new_id != old_id")
I can the leftouter join this to the original df (on id = old_id) and swap the ids if there is a new_id available.
However, I need to ensure that the new_id selected is the one with the oldest created_date in the dataframe (rather than just selecting the max).
How best to go about this?
e.g. Given the data
id, created_date, name
---
17a, 2019-01-05, Jeff
17a, 2019-01-03, Jeremy
d21, 2019-01-04, Jeremy
u45, 2019-01-04, Jeremy
d21, 2019-01-02, Scott
x22, 2019-01-01, Julian
Rows 2, 3 and 4 group on Jeremy so should have the same id. The oldest id in the dataframe for the grouped ids is d21 as on row 5 the created_date is 2019-01-02, so that should be selected and applied to all rows in the dataframe with the other grouped ids, and we end up with:
id, created_date, name
---
d21, 2019-01-05, Jeff
d21, 2019-01-03, Jeremy
d21, 2019-01-04, Jeremy
d21, 2019-01-04, Jeremy
d21, 2019-01-02, Scott
x22, 2019-01-01, Julian
UPDATE:
#Charles Du - Cheers, I tried your code but it didn't work out, the oldest id was selected from the grouped names, not the df as a whole and the new_id was not applied throughout the df.
Result:
0 = {Row} Row(name='Scott', created_date='2019-01-02', new_ID='d21', id='d21', created_date='2019-01-02')
1 = {Row} Row(name='Julian', created_date='2019-01-01', new_ID='x22', id='x22', created_date='2019-01-01')
2 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='17a', created_date='2019-01-03')
3 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='d21', created_date='2019-01-04')
4 = {Row} Row(name='Jeremy', created_date='2019-01-03', new_ID='17a', id='u45', created_date='2019-01-04')
5 = {Row} Row(name='Jeff', created_date='2019-01-05', new_ID='17a', id='17a', created_date='2019-01-05')

My spitball here
from pyspark.sql import functions as F
new_df = df.groupBy('name').agg(F.min('date'))
new_df = new_df.join(df, on=['name', 'date'], how='inner')
# This should give you a df with a single record for each name with the oldest ID.
new_df = new_df.withColumnRenamed('id', 'new_ID')
#you'll need to decide on a naming convention for your date column since you'll have two if you don't rename
res = new_df.join(df, on='name', how='inner)
that should match up your id with the oldest date.

Related

Need to sort the pivot table based on the columns passed in index attribute . Its MultiIndex

Can't sort the pivot table based on the columns passed in index attribute in ascending order.
when the df is printed 'Deepthy' comes first for column Name, I need 'aarathy' to come first
pls check this image while printing
df = pd.DataFrame({'Name': ['aarathy', 'Deepthy','aarathy','aarathy'],'Ship': ['everest', 'Oasis of the Seas','everest','everest'], 'Tracking': ['TESTTRACK003', 'TESTTRACK008', 'TESTTRACK009','TESTTRACK005'],'Bag': ['123', '127', '129','121'],})
df=pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"]).sort_index(axis=1,ascending=True)
I tried it by passing sort_values and sort_index(axis=1,ascending=True) but id doesn't works
You naeed convert values to lowercase and for first level of sorting use key parameter:
#helper column for run your code
df['new'] = 1
df=(pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"])
.sort_index(level=0,ascending=True, key=lambda x: x.str.lower()))
print (df)
new
Name Ship Tracking Bag
aarathy everest TESTTRACK003 123 1
TESTTRACK005 121 1
TESTTRACK009 129 1
Deepthy Oasis of the Seas TESTTRACK008 127 1

add a categorical column with three values assigned to each row in a pyspark df then perform aggregated functions to 30 columns

I have a dataframe as a result of validation codes:
df=\
(['c_1', 'c_1', 'c_1', 'c_2', 'c_3', 'c_1', 'c_2', 'c_2'],\
['valid','valid', 'invalid','missing','invalid','valid','valid', 'missing'],\
['missing','valid','invalid','invalid','valid', 'valid','missing','missing'],\
['invalid','valid','valid', 'missing', 'missing','valid','invalid','missing'])\
.toDF('clinic_id','name','phone','city')
I counted the number of valids, invalids, and missing using aggregated code grouped by clinic_id in pyspark
agg_table = (
df
.groupBy('clinic_id')
.agg(
# name
sum(when(col('name') == 'valid',1).otherwise(0)).alias('validname')
,sum(when(col('name') == 'invalid',1).otherwise(0)).alias('invalidname')
,sum(when(col('name') == 'missing',1).otherwise(0)).alias('missingname')
# phone
,sum(when(col('phone') == 'valid',1).otherwise(0)).alias('validphone')
,sum(when(col('phone') == 'invalid',1).otherwise(0)).alias('invalidphone')
,sum(when(col('phone') == 'missing',1).otherwise(0)).alias('missingphone')
# city
,sum(when(col('city') == 'valid',1).otherwise(0)).alias('validcity')
,sum(when(col('city') == 'invalid',1).otherwise(0)).alias('invalidcity')
,sum(when(col('city') == 'missing',1).otherwise(0)).alias('missingcity')
))
display(agg_table)
output:
clinic_id validname invalidname missingname ... invalidcity missingcity
--------- --------- ----------- ----------- ... ----------- -----------
c_1 3 1 0 ... 1 0
c_2 1 0 2 ... 1 0
c_3 0 1 0 ... 0 1
the resulting aggregated table is just fine, but is not ideal for further analysis. I tried the pivoting within pyspark trying to get something below:
#note: counts below are just made up, not the actual count from above, but I hope you get what I mean.
clinic_id category name phone city
-------- ------- ---- ------- ----
c_1 valid 3 1 3
c_1 invalid 1 0 2
c_1 missing 0 2 3
c_2 valid 3 1 3
c_2 invalid 1 0 2
c_2 missing 0 2 3
c_3 valid 3 1 3
c_3 invalid 1 0 2
c_3 missing 0 2 3
I initially searched pivot/unpivot, but I learned it is called unstack in pyspark and I also came across mapping.
I tried the suggested approach in How to unstack dataset (using pivot)? but it is showing me only one column and I cannot get the desired result when I try applying it to my dataframe of 30 columns.
I also tried the following using the validated table/dataframe
expression = ""
cnt=0
for column in agg_table.columns:
if column!='clinc_id':
cnt +=1
expression += f"'{column}' , {column},"
exprs = f"stack({cnt}, {expression[:-1]}) as (Type,Value)"
unpivoted = agg_table.select('clinic_id',expr(exprs))
I get an error just pointing to the line that may be referring to a return value.
I also tried grouping the results by id and the category but that is where I am stuck at finding solution. If I group by an aggregated variable, say the values of the validname, the agggregated function only counts the values in that column and would not apply to every count columns. So I thought of inserting a column using .withColumn function assigning the three categories to each ID so that each aggregated counts will be grouped by id and category as in the prior table, but I am not feeling lucky in finding solution to this.
Also, maybe a sql approach will be easier?
I found the right search phrase: "column to row in pyspark"
One of the suggested answer that fit my dataframe is this function:
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["clinic_id"])
This created a dataframe of three columns: clinic_id, column_names, status (valid, invalid, missing)
Then I created my aggregated table grouped by clinic_id, status:
display(long_df.groupBy('Clinic_id','Status')
.agg(
sum(when(col('column_names') == 'name',1).otherwise(0)).alias('name')
,sum(when(col('column_names') == 'phone',1).otherwise(0)).alias('phone')
,sum(when(col('column_names') == 'city',1).otherwise(0)).alias('city')
).show
I got my intended table.

Create new column based on date column Pandas

I am trying to create a new column of students' grade levels based on their DOB. The cut off dates for 1st grade would be 2014/09/02 - 2015/09/01. Is there a simple solution for this besides making a long if/elif statement. Thanks.
Name
DOB
Sally
2011/06/20
Mike
2009/02/19
Kevin
2012/12/22
You can use pd.cut(), which also supports custom bins.
from datetime import date
import pandas as pd
dob = {
'Sally': '2011/06/20',
'Mike': '2009/02/19',
'Kevin': '2012/12/22',
'Ron': '2009/09/01',
}
dob = pd.Series(dob).astype('datetime64').rename("DOB").to_frame()
grades = [
'2008/9/1',
'2009/9/1',
'2010/9/1',
'2011/9/1',
'2012/9/1',
'2013/9/1',
]
grades = pd.Series(grades).astype('datetime64')
dob['grade'] = pd.cut(dob['DOB'], grades, labels = [5, 4, 3, 2, 1])
print(dob.sort_values('DOB'))
DOB grade
Mike 2009-02-19 5
Ron 2009-09-01 5
Sally 2011-06-20 3
Kevin 2012-12-22 1
I sorted the data frame by date of birth, to show that oldest students are in the highest grades.
​

second max value in column for each group in pandas

US_Sales=pd.read_excel("C:\\Users\\xxxxx\\Desktop\\US_Sales.xlsx")
US_Sales
US_Sales.State.nlargest(2,'Sales').groupby(['Sales'])
i want second max sales for each city wise
no sample data so simulated
sort, shift and take first value gives result you want
df = pd.DataFrame([{"state":"Florida","sales":[22,4,5,6,7,8]},
{"state":"California","sales":[99,9,10,11]}]).explode("sales").reset_index(drop=True)
df.sort_values(["state","sales"], ascending=[1,0]).groupby("state").agg({"sales":lambda x: x.shift(-1).values[0]})
state
sales
California
11
Florida
8
utility function
import functools
def nlargest(x, n=2):
return x.sort_values(ascending=False).shift((n-1)*-1).values[0]
df.groupby("state", as_index=False).agg({"sales":functools.partial(nlargest, n=2)})
You can sort the Sales column descending, then takes the 2nd row with pandas.core.groupby.GroupBy.nth() in each group. Note that n in nth() is zero indexed.
US_Sales.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State').nth(1).reset_index()
You can also choose the largest 2 values then keep the last by various methods:
largest2 = df.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State')['Sales'].nlargest(2)
# Method 1
# Drop duplicates by `State`, keep the last one
largest2.reset_index().drop('level_1', axis=1).drop_duplicates(['State'], keep='last')
# Method 2
# Group by `State`, keep the last one
largest2.groupby('State').tail(1).reset_index().drop('level_1', axis=1)

Create a new column in Pandas dataframe by arbitrary function over rows

I have a Pandas dataframe. I want to add a column to the dataframe, where the value in the new column is dependent on other values in the row.
What is an efficient way to go about this?
Example
Begin
Start with this dataframe (let's call it df), and a dictionary of people's roles.
first last
--------------------------
0 Jon McSmith
1 Jennifer Foobar
2 Dan Raizman
3 Alden Lowe
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
End
We have a dataframe where we've 'iterated' over each row, and used the last_name to lookup values from our role_dict and add that value to each row as role.
first last role
--------------------------------------
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer
One solution is using series map function since the role is a dictionary
df['role'] = df.loc[:, 'last'].map(role_dict)
try this using merge
import pandas as pd
df = pd.DataFrame([["Jon","McSmith"],
["Jennifer","Foobar"],
["Dan","Raizman"],
["Alden","Lowe"]],columns=["first","last"])
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
df_2 = pd.DataFrame(role_dict.items(),columns=["last","role"])
result = pd.merge(df,df_2,on=["last"],how="left")
output
first last role
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer