Multindex add zero values if no in pandas dataframe - pandas

I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.

Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7

Related

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Delete row of a dataframe condition

here is my first dataframe df1
269 270 271 346
0 1 153.00 2.14 1
1 1 153.21 3.89 2
2 1 153.90 2.02 1
3 1 154.18 3.02 1
4 1 154.47 2.30 1
5 1 154.66 2.73 1
6 1 155.35 2.82 1
7 1 155.70 2.32 1
8 1 220.00 15.50 1
9 0 152.64 1.44 1
10 0 152.04 2.20 1
11 0 150.48 1.59 1
12 0 149.88 1.73 1
13 0 129.00 0.01 1
here is my second dataframe df2
269 270 271 346
0 0 149.88 2.0 1
I would like the row at the index 12 to be remove because they have the same number in columns ['269'] & ['270']
Hope below solutions would match to your requirement
Using anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c("269", "270"))
Using %in% operator
df1[!(df1$269 %in% df2$269 & df1$270 %in% df2$270),]

sorting python pandas dataframe by matching values of two different columns and calculate the mean

I have a text file that imported as pandas dataframe looks like:
a b c d e
index
0 18 1 1 -30.47 0.746
1 19 1 1 -30.47 0.751
2 20 1 1 -30.47 0.801
3 33 2 1 -30.47 1.451
4 34 2 1 -30.47 1.534
5 35 2 1 -30.47 1.551
6 49 3 1 -30.47 2.297
7 50 3 1 -30.47 2.301
8 51 3 1 -30.47 2.351
9 64 4 1 -30.47 3.001
10 65 4 1 -30.47 3.085
11 66 4 1 -30.47 3.101
12 346 1 2 -28.47 0.601
13 347 1 2 -20 0.682
14 348 1 2 -28.47 0.701
15 362 2 2 -28.47 1.445
16 363 2 2 -28.47 1.451
17 364 2 2 -28.47 1.501
18 377 3 2 -28.47 2.151
19 378 3 2 -28.47 2.233
20 379 3 2 -28.47 2.251
21 392 4 2 -28.47 2.901
22 393 4 2 -28.47 2.996
23 394 4 2 -28.47 3.001
24 675 1 3 -25 0.596
25 676 1 3 -26 0.601
26 677 1 3 -22 0.651
27 690 2 3 -26.47 1.301
28 691 2 3 -26.47 1.384
29 692 2 3 -26.47 1.401
30 705 3 3 -26.47 2.051
31 706 3 3 -26.47 2.147
32 707 3 3 -26.47 2.151
33 721 4 3 -26.47 2.851
34 722 4 3 -26.47 2.935
35 723 4 3 -26.47 2.951
I have been trying to reorganize the dataframe as following: for each value in col two, for example value 1, there are multiple corresponding values in column three and four.For example
value 1 (col one) corresponds to: value 1 (col three), -3.47 (col four);
value 1 (col three), -3.47 (col four); value 1 (col three), -3.47 (col four).....value 3 (col three), -25 (col four); value 3 (col three), -26 (col four); value 3 (col three), -22 (col four)
and so on. I would like to create a new dataframe where for value 1 there are other three corresponding columns, 1,2 and 3, containing the mean of the three values of the original column four. The output should look like:
col 1, col 2, col 3, col 4
1 mean(-3.47,-3.47,-3.47) mean(-28.47,-20,-20.47) mean(-25, -26,-22)
The output should contain all the values of column one, in this case 1,2,3 and 4 (table 4x3). I am not an expert in python, I have no idea how I should approach this task besides matching values in couple. Any help is more than welcome!
IIUC:
df.groupby([2,3])[4].mean().reset_index(name='Mean').pivot(columns=3,index=2,values='Mean')
Output:
3 1 2 3
2
1 -30.47 -25.646667 -24.333333
2 -30.47 -28.470000 -26.470000
3 -30.47 -28.470000 -26.470000
4 -30.47 -28.470000 -26.470000

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0