pandas reset_index of certain level removes entire level of multiindex - pandas

I have DataFrame like this:
performance
year month week
2015 1 2 4.170358
3 3.423766
4 -1.835888
5 8.157457
2 6 -3.276887
... ...
2018 7 30 -1.045241
31 -0.870845
8 31 0.950555
32 6.757876
33 -2.203334
I want to have week in range(0 or 1,n) where n = number of weeks in current year and month.
Well, the easy way I thought is to use
df.reset_index(level=2, drop=True)
But it's mistake I realized later, in best scenario I would get
performance
year month week
2015 1 0 4.170358
1 3.423766
2 -1.835888
3 8.157457
2 4 -3.276887
... ...
2018 7 n-4 -1.045241
n-3 -0.870845
8 n-2 0.950555
n-1 6.757876
n -2.203334
But after I did that, I got an unexpected behaviour
close
timestamp timestamp
2015 1 4.170358
1 3.423766
1 -1.835888
1 8.157457
2 -3.276887
... ...
2018 7 -1.045241
7 -0.870845
8 0.950555
8 6.757876
8 -2.203334
I lost entire 2nd level of index! Why? I thought it will be 0 to n for each 'cluster' (Ye, it's mistake, I realized it as I mentioned above)...
I solved my problem somesthing like that
df.groupby(level = [0, 1]).apply(lambda x: x.reset_index(drop=True))
And got my desired form of DataFrame like that:
performance
year month
2015 1 0 4.170358
1 3.423766
2 -1.835888
3 8.157457
2 0 -3.276887
... ...
2018 7 3 -1.045241
4 -0.870845
8 0 0.950555
1 6.757876
2 -2.203334
But WHY? Why reset_index on certain level just drops it? That's the main quastion!

reset_index with drop=True adds a default index only when you are reseting the whole index. If you're reseting just a single level of a multi-level index, it will just remove it.

Related

Python Lambda Apply Function Multiple Conditions using OR

I've searched this one and cannot find a solution. I have a multiple data condition where when either condition is met, is summed. In my dataset, I have used "apply" and the lambda function for a single condition <, >. However, I have a continuous data column where a count is based on either a low value OR a high value. I have tried variations of this below but keep getting a "ValueError:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Let's say my data looks like this: dfdata
Site data month day year
A 4 1 1 2021
A 17 1 2 2021
A 8 1 3 2021
A 7 1 1 2022
A 0 1 2 2022
A 2 1 3 2022
B 3 1 1 2021
B 16 1 2 2021
B 9 1 3 2021
B 2 1 1 2022
B 18 1 2 2022
B 5 1 3 2022
I've used a for loop that should give the following result below for evaluating the "data" column and counting the instances of the value < 4 OR > 15. I think that the "|" operator might do this but I get a True/False...
sites = ['A','B']
n = len(sites)
dft = pd.DataFrame();
for n in sites:
dft.loc[:,n] = dfdata[dfdata['Site']==n].groupby(["month", "day"])["data"].apply(lambda x: (x < 4) or (x > 15).sum())
the result.
month day A B
1 1 0 2
1 2 2 2
1 3 1 0
Thanks for your help.
You don't have to use (and should avoid) loops in pandas. Aside from being slow, it also make you intention harder to read.
Here's on solution using pandas functions:
dft = (
dfdata.query("data < 4 or data > 15")
.groupby(["month", "day", "Site"])["data"]
.sum()
.unstack(fill_value=0)
)
The query filters for rows whose data is <4 or >17. The rest is just adding them up and reshaping the resulting dataframe.

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Can I reference a prior row's value and populate it in the current row in a new column?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
45.593
1
1
1
2
2003
46.538
46.66
46.47
46.673
45.593
1
2
1
3
2003
46.717
46.781
46.53
46.750
45.593
1
3
1
4
2003
46.815
46.843
46.68
46.750
45.593
1
4
1
5
2003
46.935
47.000
46.56
46.593
45.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
389.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
389.019
43
7259
10
28
2022
379.869
389.519
379.67
389.019
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
385.24
44
7261
11
1
2022
390.14
390.39
383.29
384.519
385.24
44
I want to create a new column titled 'Prior_Week_Close' which will reference the prior week's 'Week Close' value (and the last week of the prior year for the first week of every year). For example, row 7260's value for Prior_Week_Close should equal 389.019
I'm trying:
SPY['prior_week_close'] = np.where(SPY['Week'].shift(1) == (SPY['Week'] - 1), SPY['Week_Close'].shift(1), np.nan)
TypeError: boolean value of NA is ambiguous
I thought about just using shift and creating a new column but some weeks only have 4 days and that would lead to inaccurate values.
Any help is greatly appreciated!
I was able to solve this by creating a new column called 'Overall_Week' (the week number in the entire data set, not just the calendar year) and using the following code:
def fn(s):
result = SPY[SPY.Overall_Week == (s.iloc[0] - 1)]['Week_Close']
if result.shape[0] > 0:
return np.broadcast_to(result.iloc[0], s.shape)
else:
return np.broadcast_to(np.NaN, s.shape)
SPY['Prior_Week_Close'] = SPY.groupby('Overall_Week')['Overall_Week'].transform(fn)```

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2