Pandas dataframe column math when row conditions is met - pandas

I have a dataframe containing the following data. I would like to query the age column of each dataframe (1-4) for values between 295.0 and 305.0. For each dataframe there will be a single age value in this range and a corresponding subsidence value. I would like to take the subsidence value and add it to the remaining values in the dataframe.
For instance in the first dataframe; at age 300.0 subsidence= 274.057861. In this case, 274.057861 would be added to the rest of the subsidence values in dataframe 1.
In the second data frame; at age 299.0 subsidence= 77.773720. So, 77.773720 would be added to to the rest of the subsidence values in dataframe 2. Etc, etc. Is it possible to do this easily in Pandas or am I better off working towards an alternate solution.
Thanks :)
1 2 3 4 \
age subsidence age subsidence age subsidence age
0 0.0 -201.538712 0.0 -235.865433 0.0 134.728821 0.0
1 10.0 -77.446548 8.0 -102.183365 10.0 88.796074 10.0
2 20.0 44.901043 18.0 35.316868 20.0 35.871178 20.0
3 31.0 103.172806 28.0 98.238434 30.0 -17.901653 30.0
4 41.0 124.625687 38.0 124.719254 40.0 -13.381897 40.0
5 51.0 122.877541 48.0 130.725235 50.0 -25.396996 50.0
6 61.0 138.810898 58.0 140.301117 60.0 -37.057205 60.0
7 71.0 119.818176 68.0 137.433670 70.0 -11.587639 70.0
8 81.0 77.867607 78.0 96.285652 80.0 21.854662 80.0
9 91.0 33.612885 88.0 32.740803 90.0 67.754501 90.0
10 101.0 15.885051 98.0 8.626043 100.0 150.172699 100.0
11 111.0 118.089211 109.0 88.812439 100.0 150.172699 100.0
12 121.0 247.301956 119.0 212.000061 110.0 124.367874 110.0
13 131.0 268.748627 129.0 253.204819 120.0 157.066010 120.0
14 141.0 231.799255 139.0 292.828461 130.0 145.811783 130.0
15 151.0 259.626343 149.0 260.067993 140.0 175.388763 140.0
16 161.0 288.704651 159.0 240.051605 150.0 265.435791 150.0
17 171.0 249.121857 169.0 203.727097 160.0 336.471924 160.0
18 181.0 339.038055 179.0 245.738480 170.0 283.483582 170.0
19 191.0 395.920410 189.0 318.751160 180.0 381.575500 180.0
20 201.0 404.843445 199.0 338.245209 190.0 491.534424 190.0
21 211.0 461.865784 209.0 418.997559 200.0 495.025604 200.0
22 221.0 518.710632 219.0 446.496216 200.0 495.025604 200.0
23 231.0 483.963867 224.0 479.213287 210.0 571.982361 210.0
24 239.0 445.292389 229.0 492.352905 220.0 611.698608 220.0
25 249.0 396.609497 239.0 445.322144 230.0 645.545776 230.0
26 259.0 321.553558 249.0 429.429932 240.0 596.046265 240.0
27 269.0 306.150177 259.0 297.355103 250.0 547.157654 250.0
28 279.0 259.717468 269.0 174.210785 260.0 457.071472 260.0
29 289.0 301.114410 279.0 114.175957 270.0 438.705170 270.0
30 300.0 274.057861 289.0 91.768898 280.0 397.985535 280.0
31 310.0 216.760361 299.0 77.773720 290.0 426.858276 290.0
32 320.0 192.317093 309.0 73.767090 300.0 410.508331 300.0
33 330.0 179.511917 319.0 63.295345 300.0 410.508331 300.0
34 340.0 231.126053 329.0 -4.296405 310.0 355.303558 310.0
35 350.0 142.894958 339.0 -62.745190 320.0 284.932892 320.0
36 360.0 51.547047 350.0 -60.224789 330.0 251.817078 330.0
37 370.0 -39.064964 360.0 -85.826874 340.0 302.303925 340.0
38 380.0 -54.111374 370.0 -81.139206 350.0 207.799942 350.0
39 390.0 -68.999535 380.0 -40.080212 360.0 77.729439 360.0
40 400.0 -47.595322 390.0 -29.945852 370.0 -127.037209 370.0
41 410.0 13.159509 400.0 -26.656607 380.0 -109.327545 380.0
42 NaN NaN 410.0 -13.723764 390.0 -127.160942 390.0
43 NaN NaN NaN NaN 400.0 -61.404510 400.0
44 NaN NaN NaN NaN 410.0 13.058900 410.0

For the first Dataframe:
df1['subsidence'] = df1[(df1.age >295) & (df1.age <305)]['subsidence'].value
You need to update each dataframes accordingly.

Related

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

Changing index type from a value_counts()

I am trying to change the index type from int to string after a value_counts()
df
['value']
.value_counts()
.sort_index()
output:
40.0 1448
45.0 28558
50.0 83675
55.0 96377
60.0 47351
65.0 13226
70.0 2602
75.0 568
80.0 72
100.0 52
105.0 53
Name: value, dtype: int64
expected output:
40.0 1448
45.0 28558
50.0 83675
55.0 96377
60.0 47351
65.0 13226
70.0 2602
75.0 568
80.0 72
100.0 52
105.0 53
Name: value, dtype: string
If need convert sorted index values like 40.0 use rename:
df['value'].value_counts().sort_index().rename(index=str)
If need convert count values like 1448 use Series.astype:
df['value'].value_counts().sort_index().astype(str)

Group by Index of Row in Pandas

I want to group and sum every 7 rows together (Hence to get a total of each week). There are currently two columns. One for date and the other for a float.
1/22/2020 NaN
1/23/2020 0.0
1/24/2020 1.0
1/25/2020 0.0
1/26/2020 3.0
1/27/2020 0.0
1/28/2020 0.0
1/29/2020 0.0
1/30/2020 0.0
1/31/2020 2.0
2/1/2020 1.0
2/2/2020 0.0
2/3/2020 3.0
2/4/2020 0.0
2/5/2020 0.0
2/6/2020 0.0
2/7/2020 0.0
2/8/2020 0.0
2/9/2020 0.0
2/10/2020 0.0
2/11/2020 1.0
2/12/2020 0.0
2/13/2020 1.0
2/14/2020 0.0
2/15/2020 0.0
2/16/2020 0.0
2/17/2020 0.0
2/18/2020 0.0
2/19/2020 0.0
2/20/2020 0.0
... ...
2/28/2020 0.0
2/29/2020 8.0
3/1/2020 6.0
3/2/2020 23.0
3/3/2020 20.0
3/4/2020 31.0
3/5/2020 68.0
3/6/2020 45.0
3/7/2020 119.0
3/8/2020 114.0
3/9/2020 64.0
3/10/2020 194.0
3/11/2020 397.0
3/12/2020 452.0
3/13/2020 590.0
3/14/2020 710.0
3/15/2020 61.0
3/16/2020 1389.0
3/17/2020 1789.0
3/18/2020 906.0
3/19/2020 3068.0
3/20/2020 4009.0
3/21/2020 4017.0
3/23/2020 25568.0
3/24/2020 10074.0
3/25/2020 12043.0
3/26/2020 18058.0
3/27/2020 17822.0
3/28/2020 19825.0
3/29/2020 19408.0
Assuming your date column is called dt and your value column is val:
import numpy as np
# in case if it's not already date time format:
df["dt"]=pd.to_datetime(df["dt"])
# your data looks sorted, but in case if it's not - that's the prerequisite here:
df=df.sort_values("dt")
df=df.groupby(np.arange(len(df))//7).agg({"dt": (min, max), "val": sum})
The aggregation for dt is done only so you can explicitly indicate aggregated interval - it might be enough to just take min for instance, or ignore it at all...
Set the date column as the index and use resample
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.resample('1W').sum()

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

Pandas resample with percentage change

I am trying to resample my df to get an yearly data filling by percentage change.
Here is my dataframe.
data = {'year': ['2000', '2000', '2003', '2003', '2005', '2005'],
'country':['UK', 'US', 'UK','US','UK','US'],
'sales': [0, 10, 30, 25, 40, 45],
'cost': [0, 100, 300, 250, 400, 450]
}
df=pd.DataFrame(data)
dfL=df.copy()
dfL.year=dfL.year.astype('str') + '-01-01 00:00:00.00000'
dfL.year=pd.to_datetime(dfL.year)
dfL=dfL.set_index('year')
dfL
country sales cost
year
2000-01-01 UK 0 0
2000-01-01 US 10 100
2003-01-01 UK 30 300
2003-01-01 US 25 250
2005-01-01 UK 40 400
2005-01-01 US 55 550
I would like to get an output like the below..
country sales cost
year
2000-01-01 UK 0 0
2001-01-01 UK 10 100
2002-01-01 UK 20 200
2003-01-01 UK 30 300
2004-01-01 UK 35 350
2005-01-01 UK 40 400
2000-01-01 US 10 100
2001-01-01 US 15 150
2002-01-01 US 20 200
2003-01-01 US 25 250
2004-01-01 US 35 350
2005-01-01 US 45 450
I hope I would need to do a resample yearwise. but not very sure about the apply function to use.
Can any one help ?
Using resample + interpolate and reshape method stack and unstack
dfL=dfL.set_index('country',append=True).unstack().resample('YS').interpolate().stack().reset_index(level=1)
dfL
Out[309]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0
I'd use a pivot_table to do this and then resample:
In [11]: res = dfL.pivot_table(index="year", columns="country", values=["sales", "cost"])
In [12]: res
Out[12]:
cost sales
country UK US UK US
year
2000-01-01 0 100 0 10
2003-01-01 300 250 30 25
2005-01-01 400 450 40 45
In [13]: res.resample("YS").interpolate()
Out[13]:
cost sales
country UK US UK US
year
2000-01-01 0.0 100.0 0.0 10.0
2001-01-01 100.0 150.0 10.0 15.0
2002-01-01 200.0 200.0 20.0 20.0
2003-01-01 300.0 250.0 30.0 25.0
2004-01-01 350.0 350.0 35.0 35.0
2005-01-01 400.0 450.0 40.0 45.0
Personally I'd keep it in this format, but if you want to stack it back, you can stack and reset_index:
In [14]: res.resample("YS").interpolate().stack(level=1).reset_index(level=1)
Out[14]:
country cost sales
year
2000-01-01 UK 0.0 0.0
2000-01-01 US 100.0 10.0
2001-01-01 UK 100.0 10.0
2001-01-01 US 150.0 15.0
2002-01-01 UK 200.0 20.0
2002-01-01 US 200.0 20.0
2003-01-01 UK 300.0 30.0
2003-01-01 US 250.0 25.0
2004-01-01 UK 350.0 35.0
2004-01-01 US 350.0 35.0
2005-01-01 UK 400.0 40.0
2005-01-01 US 450.0 45.0