second max value in column for each group in pandas - pandas

US_Sales=pd.read_excel("C:\\Users\\xxxxx\\Desktop\\US_Sales.xlsx")
US_Sales
US_Sales.State.nlargest(2,'Sales').groupby(['Sales'])
i want second max sales for each city wise

no sample data so simulated
sort, shift and take first value gives result you want
df = pd.DataFrame([{"state":"Florida","sales":[22,4,5,6,7,8]},
{"state":"California","sales":[99,9,10,11]}]).explode("sales").reset_index(drop=True)
df.sort_values(["state","sales"], ascending=[1,0]).groupby("state").agg({"sales":lambda x: x.shift(-1).values[0]})
state
sales
California
11
Florida
8
utility function
import functools
def nlargest(x, n=2):
return x.sort_values(ascending=False).shift((n-1)*-1).values[0]
df.groupby("state", as_index=False).agg({"sales":functools.partial(nlargest, n=2)})

You can sort the Sales column descending, then takes the 2nd row with pandas.core.groupby.GroupBy.nth() in each group. Note that n in nth() is zero indexed.
US_Sales.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State').nth(1).reset_index()
You can also choose the largest 2 values then keep the last by various methods:
largest2 = df.sort_values(['State', 'Sales'], ascending=[True, False]).groupby('State')['Sales'].nlargest(2)
# Method 1
# Drop duplicates by `State`, keep the last one
largest2.reset_index().drop('level_1', axis=1).drop_duplicates(['State'], keep='last')
# Method 2
# Group by `State`, keep the last one
largest2.groupby('State').tail(1).reset_index().drop('level_1', axis=1)

Related

Need to sort the pivot table based on the columns passed in index attribute . Its MultiIndex

Can't sort the pivot table based on the columns passed in index attribute in ascending order.
when the df is printed 'Deepthy' comes first for column Name, I need 'aarathy' to come first
pls check this image while printing
df = pd.DataFrame({'Name': ['aarathy', 'Deepthy','aarathy','aarathy'],'Ship': ['everest', 'Oasis of the Seas','everest','everest'], 'Tracking': ['TESTTRACK003', 'TESTTRACK008', 'TESTTRACK009','TESTTRACK005'],'Bag': ['123', '127', '129','121'],})
df=pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"]).sort_index(axis=1,ascending=True)
I tried it by passing sort_values and sort_index(axis=1,ascending=True) but id doesn't works
You naeed convert values to lowercase and for first level of sorting use key parameter:
#helper column for run your code
df['new'] = 1
df=(pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"])
.sort_index(level=0,ascending=True, key=lambda x: x.str.lower()))
print (df)
new
Name Ship Tracking Bag
aarathy everest TESTTRACK003 123 1
TESTTRACK005 121 1
TESTTRACK009 129 1
Deepthy Oasis of the Seas TESTTRACK008 127 1

DataFrame append to DataFrame row by row and reset if condition is matched

I have a DataFrame which I want to slice into many DataFrames by adding rows by one until the sum of column Score of the DataFrame is greater than 50,000. Once that condition is met, then I want a new slice to begin.
Here is an example of what this might look like:
Sum Score cumulatively, floor divide it by 50,000, and shift it up one cell (since you want each group to be > 50,000 and not < 50,000).
import pandas as pd
import numpy as np
# Generating DataFrame with random data
df = pd.DataFrame(np.random.randint(1,60000,15))
# Creating new column that's a cumulative sum with each
# value floor divided by 50000
df['groups'] = df[0].cumsum() // 50000
# Values shifted up one and missing values filled with the maximum value
# so that values at the bottom are included in the last DataFrame slice
df.groups = df.groups.shift(-1, fill_value=df.groups.max())
Then as per this answer you can use pandas.DataFrame.groupby in a list comprehension to return a list of split DataFrames.
df_list = [df_slice for _, df_slice in df.groupby(['groups'])]

Update Column Value for previous entries of row value, based on new entry

enter image description here
I have a dataframe that is updated monthly as such, with a new row for each employee.
If an employee decides to change their gender (for example here, employee 20215 changed from M to F in April 2022, I want all previous entries for that employee number 20215 to be switched to F as well.
This is for a database with roughly 15 million entries, and multiple such changes every month, so I was hoping for a scalable solution (I cannot simply put df['Gender'] = 'F' for example)
Since we didn’t receive a df from you or any code, I neede to generate something myself in order to test it. Please provide enugh code a give us a sample next time as well.
Here the generated df, in case someone comes with a better answer:
import pandas as pd, numpy as np
length=100
df = pd.DataFrame({'ID': np.random.randint(1001,1020,length),
'Ticket': np.random.randint(length),
'salary_grade' : np.random.randint(0,10,size=length),
'date': np.arange(length),
'genre': 'M' })
df['date']=pd.to_numeric(df['date'])
df['date']=pd.to_datetime(df['date'],dayfirst=True,unit='D',origin='15.04.2022')
That is the base DF, now I needed to estimulate some gender changes:
test_id=df.groupby(['ID'])['genre'].count().idxmax() # gives me the employee with most entries.
test_id
df[df['ID']==test_id].loc[:,'genre'] # getting all indexes from test_id, for a testchange/later for checking
df[df['ID']==test_id] # getting indexes of test_ID for gender change
id_lst=[]
for idx in df[df['ID']==test_id].index:
if idx>28: # <-- change this value for you generated df, middle of list
id_lst.append(idx) # returns a list of indexes where gender chage will happen
df.loc[id_lst,'genre']='F' # applying a gender change
Answer:
Finally to your answer:
finder=df.groupby(['ID']).agg({'genre' : lambda x: len(list(pd.unique(x)))>1 , 'date' : 'min'}) # Will return True for every ID with more then 2 genres
finder[finder['genre']] # will return IDs from above condition.
Next steps...
Now with the ID you just need to discover if its M-->F or F-->M new_genreand assign the new genre for the ID_found (int or list).
df.loc[ID_found,'genre']=new_genre

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)