Groupby and sum based on column name - pandas

I have a dataframe:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190]
})
The columns represent years and months respectively. I would like to sum the columns for months into a new columns for the year. The result should look like the following:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190],
'2019': [400, 490, 710, 520, 560, 610],
'2020': [600, 750, 1350, 360, 540, 570]
})
My actual dataset has a number of years and has 12 months for each year. Hoping not to have to add the columns manually.

Try creating a DataFrame that contains the year columns and convert the column names to_datetime :
data_df = df.iloc[:, 2:]
data_df.columns = pd.to_datetime(data_df.columns, format='%Y%m')
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
0 100 100 200 200 200 200
1 120 120 250 250 250 250
2 130 130 450 450 450 450
3 200 200 120 120 120 120
4 190 190 180 180 180 180
5 210 210 190 190 190 190
resample sum the columns by Year and rename columns to just the year values:
data_df = (
data_df.resample('Y', axis=1).sum().rename(columns=lambda c: c.year)
)
2019 2020
0 400 600
1 490 750
2 710 1350
3 520 360
4 560 540
5 610 570
Then join back to the original DataFrame:
new_df = df.join(data_df)
new_df:
BU Line_Item 201901 201902 201903 202001 202002 202003 2019 2020
0 AA Revenues 100 100 200 200 200 200 400 600
1 AA EBT 120 120 250 250 250 250 490 750
2 AA Expenses 130 130 450 450 450 450 710 1350
3 BB Revenues 200 200 120 120 120 120 520 360
4 BB EBT 190 190 180 180 180 180 560 540
5 BB Expenses 210 210 190 190 190 190 610 570

Are the columns are you summing always the same? That is, are there always 3 2019 columns with those same names, and 3 2020 columns with those names? If so, you can just hardcode those new columns.
df['2019'] = df['201901'] + df['201902'] + df['201903']
df['2020'] = df['202001'] + df['202002'] + df['202003']

Related

Getting positions from two given numpy arrays

I have two set of given numbers (100,110), and (20, 30).
I wanted get numbers between them.
X = np.arange(100, 110)
Y = np.arange(20, 30)
print (X)
print (Y)
[100 101 102 103 104 105 106 107 108 109]
[20 21 22 23 24 25 26 27 28 29]
I wanted to get their positions as follows.
xy = np.array( [(x,y) for x in X for y in Y])
print (xy)
X_result = xy[:,0]
Y_result = xy[:,1]
The results are correct.
However, wondering if it could be obtained more directly and more faster.
Expected results are same as shown by the prints of (X_result and Y_result).
print (X_result)
print (Y_result)
[100 100 100 100 100 100 100 100 100 100 101 101 101 101 101 101 101 101
101 101 102 102 102 102 102 102 102 102 102 102 103 103 103 103 103 103
103 103 103 103 104 104 104 104 104 104 104 104 104 104 105 105 105 105
105 105 105 105 105 105 106 106 106 106 106 106 106 106 106 106 107 107
107 107 107 107 107 107 107 107 108 108 108 108 108 108 108 108 108 108
109 109 109 109 109 109 109 109 109 109]
[20 21 22 23 24 25 26 27 28 29 20 21 22 23 24 25 26 27 28 29 20 21 22 23
24 25 26 27 28 29 20 21 22 23 24 25 26 27 28 29 20 21 22 23 24 25 26 27
28 29 20 21 22 23 24 25 26 27 28 29 20 21 22 23 24 25 26 27 28 29 20 21
22 23 24 25 26 27 28 29 20 21 22 23 24 25 26 27 28 29 20 21 22 23 24 25
26 27 28 29]
Edit.
I noticed that what I wanted is:
X_result, Y_result = np.meshgrid(X, Y)
print (X_result.flatten())
print (Y_result.flatten())
Please let me know if there is other better ways of doing it.
You can use numpy.meshgrid:
np.meshgrid(X, Y, indexing='ij')
[array([[100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
[101, 101, 101, 101, 101, 101, 101, 101, 101, 101],
[102, 102, 102, 102, 102, 102, 102, 102, 102, 102],
[103, 103, 103, 103, 103, 103, 103, 103, 103, 103],
[104, 104, 104, 104, 104, 104, 104, 104, 104, 104],
[105, 105, 105, 105, 105, 105, 105, 105, 105, 105],
[106, 106, 106, 106, 106, 106, 106, 106, 106, 106],
[107, 107, 107, 107, 107, 107, 107, 107, 107, 107],
[108, 108, 108, 108, 108, 108, 108, 108, 108, 108],
[109, 109, 109, 109, 109, 109, 109, 109, 109, 109]]), array([[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])]

Merge first two of a dataframe and create new header

I have a dataframe
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 130, 200, 190, 210],
'Col2': ['xyz','tuv', 130, 200, 190, 210],
'Col3': ['pqr', 'set', 130, 200, 190, 210],})
I wish to take the first two rows of the dataframe, merge them separated by a hyphen and convert them into a new header. I tried
df.columns = np.concatenate(df.iloc[0], df.iloc[1])
df.columns = new_header
But that does not seem to work. The output should look like
df = pd.DataFrame({
'abc_qrt': [ 130, 200, 190, 210],
'xyz_tuv': [130, 200, 190, 210],
'pqr_set': [ 130, 200, 190, 210],})
Try with
df = df.T.set_index([0,1]).T
df.columns = df.columns.map('_'.join)
df
Out[308]:
abc_qrt xyz_tuv pqr_set
2 130 130 130
3 200 200 200
4 190 190 190
5 210 210 210
You can take the first two rows, join them with _ and then set columns of the rest with that:
df.iloc[2:].set_axis(df.iloc[:2].agg("_".join), axis=1)
to get
abc_qrt xyz_tuv pqr_set
2 130 130 130
3 200 200 200
4 190 190 190
5 210 210 210

Sorting pandas dataframe by groups

I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)

Pandas resample at a custom start timestamp

I’m trying to resample data at a given start time
my program:
sales = [{'Timestamp': '2018-06-22 15:15:00', 'Jan': 150, 'Feb': 200, 'Mar': 140},
{'Timestamp': '2018-06-22 15:44:00', 'Jan': 250, 'Feb': 250, 'Mar': 250},
{'Timestamp': '2018-06-22 15:46:00', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'Timestamp': '2018-06-22 16:16:00', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'Timestamp': '2018-06-22 16:18:00', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'Timestamp': '2018-06-22 16:20:00', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]
df = pd.DataFrame(sales)
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp')
ResampledDF = pd.DataFrame()
ResampledDF['J'] = df.Jan.resample("30T").max()
ResampledDF['F'] = df.Feb.resample("30T").max()
ResampledDF['M'] = df.Mar.resample("30T").max()
print(ResampledDF)
Output:
J F M
Timestamp
2018-06-22 15:00:00 150 200 140
2018-06-22 15:30:00 250 250 250
2018-06-22 16:00:00 200 210 215
Here output has automatically sampled data starting at 15:00:00, whereas i want the first row at 15:15:00 and second at 15:45:00 etc. like below
Required Output:
J F M
Timestamp
2018-06-22 15:15:00 250 250 250
2018-06-22 15:45:00 200 210 215
2018-06-22 16:15:00 200 210 215
use base parameter:
In [233]: df.resample('30T', base=15).max()
Out[233]:
Feb Jan Mar
Timestamp
2018-06-22 15:15:00 250 250 250
2018-06-22 15:45:00 210 200 215
2018-06-22 16:15:00 210 200 215

How can the name of the index column of a pandas DataFrame be changed?

I've got a DataFrame that gets set up such that a column of country names is set as the index column. I want to change the title of that index column. This seems like a simple thing to do, but I can't find how to actually do it. How can it be done? How can the index "foods" column here be changed to "countries"?
import pandas as pd
df = pd.DataFrame(
[
["alcoholic drinks" , 375, 135, 458, 475],
["beverages" , 57, 47, 53, 73],
["carcase meat" , 245, 267, 242, 227],
["cereals" , 1472, 1494, 1462, 1582],
["cheese" , 105, 66, 103, 103],
["confectionery" , 54, 41, 62, 64],
["fats and oils" , 193, 209, 184, 235],
["fish" , 147, 93, 122, 160],
["fresh fruit" , 1102, 674, 957, 1137],
["fresh potatoes" , 720, 1033, 566, 874],
["fresh Veg" , 253, 143, 171, 265],
["other meat" , 685, 586, 750, 803],
["other veg." , 488, 355, 418, 570],
["processed potatoes", 198, 187, 220, 203],
["processed veg." , 360, 334, 337, 365],
["soft drinks" , 1374, 1506, 1572, 1256],
["sugars" , 156, 139, 147, 175]
],
columns = [
"foods",
"England",
"Northern Ireland",
"Scotland",
"Wales"
]
)
df = df.set_index("foods")
df = df.transpose()
df = df.rename({"foods": "countries"})
df
Try this:
df = df.rename_axis("countries", axis=0).rename_axis(None, axis=1)
Demo:
In [10]: df
Out[10]:
alcoholic drinks beverages carcase meat ...
countries
England 375 57 245
Northern Ireland 135 47 267
Scotland 458 53 242
Wales 475 73 227
food is your column index name not your index name.
You can set it explicitly like this:
df.index.name = 'countries'
Output:
foods alcoholic drinks beverages carcase meat cereals cheese \
countries
England 375 57 245 1472 105
Northern Ireland 135 47 267 1494 66
Scotland 458 53 242 1462 103
Wales 475 73 227 1582 103
And, to remove food from column index name:
df.columns.name = None
Output:
alcoholic drinks beverages carcase meat cereals cheese \
countries
England 375 57 245 1472 105
Northern Ireland 135 47 267 1494 66
Scotland 458 53 242 1462 103
Wales 475 73 227 1582 103
Pandas has an Index.rename() method. Works like this:
import pandas as pd
df = pd.DataFrame(
[
["alcoholic drinks", 375, 135, 458, 475],
["beverages", 57, 47, 53, 73],
["carcase meat", 245, 267, 242, 227],
["cereals", 1472, 1494, 1462, 1582],
["cheese", 105, 66, 103, 103],
["confectionery", 54, 41, 62, 64],
["fats and oils", 193, 209, 184, 235],
["fish", 147, 93, 122, 160],
["fresh fruit", 1102, 674, 957, 1137],
["fresh potatoes", 720, 1033, 566, 874],
["fresh Veg", 253, 143, 171, 265],
["other meat", 685, 586, 750, 803],
["other veg.", 488, 355, 418, 570],
["processed potatoes", 198, 187, 220, 203],
["processed veg.", 360, 334, 337, 365],
["soft drinks", 1374, 1506, 1572, 1256],
["sugars", 156, 139, 147, 175]
],
columns=[
"foods",
"England",
"Northern Ireland",
"Scotland",
"Wales"
]
)
df.set_index('foods', inplace=True)
df = df.transpose()
print(df.head())
foods confectionery fats and oils fish fresh fruit ...
England 54 193 147 1102
Northern Ireland 41 209 93 674
Scotland 62 184 122 957
Wales 64 235 160 1137
Renaming the index of the DataFrame:
df.index.rename('Countries', inplace=True)
print(df.head())
foods confectionery fats and oils fish fresh fruit ...
Countries
England 54 193 147 1102
Northern Ireland 41 209 93 674
Scotland 62 184 122 957
Wales 64 235 160 1137
The underlying Series that makes up the columns now has a name because of the transpose(). All we need to do is rename that to an empty string:
df.columns.rename('', inplace=True)
print(df.head())
confectionery fats and oils fish fresh fruit ...
Countries
England 54 193 147 1102
Northern Ireland 41 209 93 674
Scotland 62 184 122 957
Wales 64 235 160 1137
I don't prefer this over #MaxU's answer because it's slower, but it is shorter code for whatever that's worth.
df.stack().rename_axis(['countries', None]).unstack()
alcoholic drinks beverages carcase meat cereals
countries
England 375 57 245 1472
Northern Ireland 135 47 267 1494
Scotland 458 53 242 1462
Wales 475 73 227 1582