Sort a alphanumeric column in pandas and replace it with original column of the dataset [duplicate] - pandas

I have a data frame like this:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
As you can see, months are not in calendar order. So I created a second column to get the month number corresponding to each month (1-12). From there, how can I sort this data frame according to calendar months' order?

Use sort_values to sort the df by a specific column's values:
In [18]:
df.sort_values('2')
Out[18]:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
If you want to sort by two columns, pass a list of column labels to sort_values with the column labels ordered according to sort priority. If you use df.sort_values(['2', '0']), the result would be sorted by column 2 then column 0. Granted, this does not really make sense for this example because each value in df['2'] is unique.

I tried the solutions above and I do not achieve results, so I found a different solution that works for me. The ascending=False is to order the dataframe in descending order, by default it is True. I am using python 3.6.6 and pandas 0.23.4 versions.
final_df = df.sort_values(by=['2'], ascending=False)
You can see more details in pandas documentation here.

Using column name worked for me.
sorted_df = df.sort_values(by=['Column_name'], ascending=True)

Panda's sort_values does the work.
There are various parameters one can pass, such as ascending (bool or list of bool):
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
As the default is ascending, and OP's goal is to sort ascending, one doesn't need to specify that parameter (see the last note below for the way to solve descending), so one can use one of the following ways:
Performing the operation in-place, and keeping the same variable name. This requires one to pass inplace=True as follows:
df.sort_values(by=['2'], inplace=True)
# or
df.sort_values(by = '2', inplace = True)
# or
df.sort_values('2', inplace = True)
If doing the operation in-place is not a requirement, one can assign the change (sort) to a variable:
With the same name of the original dataframe, df as
df = df.sort_values(by=['2'])
With a different name, such as df_new, as
df_new = df.sort_values(by=['2'])
All this previous operations would give the following output
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Finally, one can reset the index with pandas.DataFrame.reset_index, to get the following
df.reset_index(drop = True, inplace = True)
# or
df = df.reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
A one-liner that sorts ascending, and resets the index would be as follows
df = df.sort_values(by=['2']).reset_index(drop = True)
[Out]:
0 1 2
0 85.6 January 1.0
1 95.5 February 2.0
2 104.8 March 3.0
3 354.7 April 4.0
4 283.5 May 5.0
5 238.7 June 6.0
6 152 July 7.0
7 55.4 August 8.0
8 212.7 September 9.0
9 249.6 October 10.0
10 278.8 November 11.0
11 176.5 December 12.0
Notes:
If one is not doing the operation in-place, forgetting the steps mentioned above may lead one (as this user) to not be able to get the expected result.
There are strong opinions on using inplace. For that, one might want to read this.
One is assuming that the column 2 is not a string. If it is, one will have to convert it:
Using pandas.to_numeric
df['2'] = pd.to_numeric(df['2'])
Using pandas.Series.astype
df['2'] = df['2'].astype(float)
If one wants in descending order, one needs to pass ascending=False as
df = df.sort_values(by=['2'], ascending=False)
# or
df.sort_values(by = '2', ascending=False, inplace=True)
[Out]:
0 1 2
2 176.5 December 12.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
1 55.4 August 8.0
5 152 July 7.0
6 238.7 June 6.0
8 283.5 May 5.0
0 354.7 April 4.0
7 104.8 March 3.0
3 95.5 February 2.0
4 85.6 January 1.0

Just as another solution:
Instead of creating the second column, you can categorize your string data(month name) and sort by that like this:
df.rename(columns={1:'month'},inplace=True)
df['month'] = pd.Categorical(df['month'],categories=['December','November','October','September','August','July','June','May','April','March','February','January'],ordered=True)
df = df.sort_values('month',ascending=False)
It will give you the ordered data by month name as you specified while creating the Categorical object.

Just adding some more operations on data. Suppose we have a dataframe df, we can do several operations to get desired outputs
ID cost tax label
1 216590 1600 test
2 523213 1800 test
3 250 1500 experiment
(df['label'].value_counts().to_frame().reset_index()).sort_values('label', ascending=False)
will give sorted output of labels as a dataframe
index label
0 test 2
1 experiment 1

This worked for me
df.sort_values(by='Column_name', inplace=True, ascending=False)

You probably need to reset the index after sorting:
df = df.sort_values('2')
df = df.reset_index(drop=True)

Here is template of sort_values according to pandas documentation.
DataFrame.sort_values(by, axis=0,
ascending=True,
inplace=False,
kind='quicksort',
na_position='last',
ignore_index=False, key=None)[source]
In this case it will be like this.
df.sort_values(by=['2'])
API Reference pandas.DataFrame.sort_values

Just adding a few more insights
df=raw_df['2'].sort_values() # will sort only one column (i.e 2)
but ,
df =raw_df.sort_values(by=["2"] , ascending = False) # this will sort the whole df in decending order on the basis of the column "2"

If you want to sort column dynamically but not alphabetically.
and dont want to use pd.sort_values().
you can try below solution.
Problem : sort column "col1" in this sequence ['A', 'C', 'D', 'B']
import pandas as pd
import numpy as np
## Sample DataFrame ##
df = pd.DataFrame({'col1': ['A', 'B', 'D', 'C', 'A']})
>>> df
col1
0 A
1 B
2 D
3 C
4 A
## Solution ##
conditions = []
values = []
for i,j in enumerate(['A','C','D','B']):
conditions.append((df['col1'] == j))
values.append(i)
df['col1_Num'] = np.select(conditions, values)
df.sort_values(by='col1_Num',inplace = True)
>>> df
col1 col1_Num
0 A 0
4 A 0
3 C 1
2 D 2
1 B 3

This one worked for me:
df=df.sort_values(by=[2])
Whereas:
df=df.sort_values(by=['2'])
is not working.

Example:
Assume you have a column with values 1 and 0 and you want to separate and use only one value, then:
// furniture is one of the columns in the csv file.
allrooms = data.groupby('furniture')['furniture'].agg('count')
allrooms
myrooms1 = pan.DataFrame(allrooms, columns = ['furniture'], index = [1])
myrooms2 = pan.DataFrame(allrooms, columns = ['furniture'], index = [0])
print(myrooms1);print(myrooms2)

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Groupby count between multiple date ranges since last-contact date

Customer data, and campaign data with each time we have contacted them. We don't contact each customer each time, so their last contacted(touched) date varies. How to achieve a groupby count, but between two dates that varies for each cust_id.
import pandas as pd
import io
tempCusts=u"""cust_id, lastBookedDate
1, 10-02-2022
2, 20-04-2022
3, 25-07-2022
4, 10-06-2022
5, 10-05-2022
6, 10-08-2022
7, 01-01-2021
8, 02-06-2022
9, 11-12-2021
10, 10-05-2022
"""
tempCamps=u"""cust_id,campaign_id,campaignMonth,campaignYear,touch,campaignDate,campaignNum
1,CN2204,4,2022,1,01-04-2022,1
2,CN2204,4,2022,1,01-04-2022,1
3,CN2204,4,2022,1,01-04-2022,1
4,CN2204,4,2022,1,01-04-2022,1
5,CN2204,4,2022,1,01-04-2022,1
6,CN2204,4,2022,1,01-04-2022,1
7,CN2204,4,2022,1,01-04-2022,1
8,CN2204,4,2022,1,01-04-2022,1
9,CN2204,4,2022,1,01-04-2022,1
10,CN2204,4,2022,1,01-04-2022,1
1,CN2205,5,2022,1,01-05-2022,2
2,CN2205,5,2022,1,01-05-2022,2
3,CN2205,5,2022,1,01-05-2022,2
4,CN2205,5,2022,1,01-05-2022,2
5,CN2205,5,2022,1,01-05-2022,2
6,CN2206,6,2022,1,01-06-2022,3
7,CN2206,6,2022,1,01-06-2022,3
8,CN2206,6,2022,1,01-06-2022,3
9,CN2206,6,2022,1,01-06-2022,3
10,CN2206,6,2022,1,01-06-2022,3"""
campaignDets = pd.read_csv(io.StringIO(tempCamps), parse_dates=True)
customerDets = pd.read_csv(io.StringIO(tempCusts), parse_dates=True)
Campaign details (campaignDets) is any customer who was part of campaign, some(most) appear in multiple campaigns as they continue to be contacted. cust_id is therefore duplicated, but not within each campaign. The customer details(customerDets), showing if/when they last had appointment.
cust_id 1: lastBooked 10-02-2022 So touchCount since then == 2
cust_id 2: last booked 20-04-2022 So touchCount since then == 1
...
This is what i'm attempting to achieve:
desired=u"""cust_id,lastBookedDate, touchesSinceBooked
1,10-02-2022,2
2,20-04-2022,1
3,25-07-2022,0
4,10-06-2022,0
5,10-05-2022,0
6,10-08-2022,0
7,01-01-2021,3
8,02-06-2022,0
9,11-12-2021,3
10,10-05-2022,1
"""
desiredDf = pd.read_csv(io.StringIO(desired), parse_dates=True)
>>> desiredDf
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2
1 2 20-04-2022 1
2 3 25-07-2022 0
3 4 10-06-2022 0
4 5 10-05-2022 0
5 6 10-08-2022 0
6 7 01-01-2021 2
7 8 02-06-2022 0
8 9 11-12-2021 3
9 10 10-05-2022 1
I've attempted to work around the guidance given on not-dissimilar problems, but these either rely on a fixed date to group on, or haven't worked within the constraints here(unless i'm missing something). I have not yet been able to cross-relate previous questions, and am sure that the simplicity of what i'm after cannot be best achieved by some awful groupby split by user into a list of df's pulling them back out & looping through a max() of each user_ids campaignDate. Surely not. Can i apply pd.merge_asof within this?
Those examples i've taken advice from that are along the same lines:
44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr
31772863/count-number-of-rows-between-two-dates-by-id-in-a-pandas-groupby-dataframe/31773404
Constraints?
None. Am happy to use any available library and/or helper cols.
Neither data source/df is especially large but the custDets ~120k, and the campaignDets ~600k, but i have time...so optimised approaches though welcome are secondary to actual solutions.
First, format as datetime:
customerDets['lastBookedDate'] = pd.to_datetime(customerDets[' lastBookedDate'], dayfirst=True)
campaignDets['campaignDate'] = pd.to_datetime(campaignDets['campaignDate'], dayfirst=True)
Then, filter on when the campaign date is greater than last booked:
df = campaignDets[(campaignDets['campaignDate']>campaignDets['cust_id'].map(customerDets.set_index('cust_id')['lastBookedDate']))]
Finally, add your new column:
customerDets['touchesSinceBooked'] = customerDets['cust_id'].map(df.groupby('cust_id')['touch'].sum()).fillna(0)
You'll get
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2.0
1 2 20-04-2022 1.0
2 3 25-07-2022 0.0
3 4 10-06-2022 0.0
4 5 10-05-2022 0.0
5 6 10-08-2022 0.0
6 7 01-01-2021 2.0
7 8 02-06-2022 0.0
8 9 11-12-2021 2.0
9 10 10-05-2022 1.0

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

How to get the groupby nth row directly in the row as an item?

I have Date, Time, Open, High, low, Close, data on a minute basis of a stock. It is arranged in ascending order ( date wise ). I want to make a new column and for every day (for each row) insert the yesterday price at second row of last date). So for instance I have mentioned price of 18812.3 in front of 11th Jan since last date was 10th Jan and its second row has a price of 18812.3. Similarly I have done it for day before yesterday too. I tried using nth of groupby object but for I have to create a group by object. The below code is getting the a new Dataframe but I would like to create a column directly having the desired values.
test = bn_futures.groupby('Date')['Open','High','Low','Close'].nth(1).reset_index()
Try: (check comments)
# Convert Date to datetime64 and set it as index
df = df.assign(Date=pd.to_datetime(df['Date'], dayfirst=True)).set_index('Date')
# Find second value for each day
prices = df.groupby(level=0)['Open'].nth(1).squeeze()
# Find last row for each day
mask = ~df.index.duplicated(keep='last')
# Create new columns
df.loc[mask, 'price at yesterday'] = prices.shift(1)
df.loc[mask, 'price 2d ago'] = prices.shift(2)
Output:
>>> df
Open price at yesterday price 2d ago
Date
2015-01-09 1 NaN NaN
2015-01-09 2 NaN NaN
2015-01-09 3 NaN NaN
2015-01-10 4 NaN NaN
2015-01-10 5 NaN NaN
2015-01-10 6 2.0 NaN
2015-01-11 7 NaN NaN
2015-01-11 8 NaN NaN
2015-01-11 9 5.0 2.0
Setup a MRE:
df = pd.DataFrame({'Date': ['09-01-2015', '09-01-2015', '09-01-2015',
'10-01-2015', '10-01-2015', '10-01-2015',
'11-01-2015', '11-01-2015', '11-01-2015'],
'Open': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())