Column values of multilevel indexed DataFrame are not properly updated - pandas

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(30).reshape(6,5), index=[list('aaabbb'), list('XYZXYZ')])
print(df)
df.loc[pd.IndexSlice['a'], 3] /= 10
print(df)
From the above code I expected below table:
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 0.13 14
b X 15 16 17 18 19
Y 20 21 22 23 24
Z 25 26 27 28 29
But the actual result is as below table:
0 1 2 3 4
a X 0 1 2 NaN 4
Y 5 6 7 NaN 9
Z 10 11 12 NaN 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
What went wrong in the code?

Need specify second level by : for select all values:
df.loc[pd.IndexSlice['a', :], 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
Solution with slice:
df.loc[(slice('a'), slice(None)), 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29

Related

How can I plot two lines in one graph where values of the lines do not exist for the same x axis?

I would like to plot SupDem (variable) where e_boix_regime==1 and SupDem where e_boix_regime==0.
My data:
year
SupDem
e_boix_regime
1997
0.98
1
1998
0.75
0
My code:
dem = dem_aut[dem_aut["e_boix_regime"]==1].SupDem
aut = dem_aut[dem_aut["e_boix_regime"]==0].SupDem
year = dem_aut["year"]
plt.plot(year, dem, label="Suuport for Democracy in Demcoracies")
plt.plot(year, aut, label="Support for Democracy in Autocracies")
plt.show()```
The error is follwoing: x and y must have same first dimension, but have shapes (53,) and (28,)
I just wanted to plot two lines together.
This can help you solve the problem. I hope you can reproduce the codee with it:
two (or more) graphs in one plot with different x-axis AND y-axis scales in python
Issue
Your issue is regarding shape of x and y. For plotting graph you need same data point/shape of x-values and y-values.
Solution
Take each year with dem_aut["e_boix_regime"]==1 and dem_aut["e_boix_regime"]==2 condition as you are doing with SupDem
Source Code
df = pd.DataFrame(
{
"SupDem": np.random.randint(1, 11, 30),
"year": np.random.randint(10, 21, 30),
"e_boix_regime": np.random.randint(1, 3, 30),
}
) # see DataFrame below
df["e_boix_regime"].value_counts() # 1 = 18, 2 = 12
df[df["e_boix_regime"] == 2][["SupDem", "year"]] # see below
# you need same no. of data points for both x/y axis i.e. `year` and `SupDem`
plt.plot(
df[df["e_boix_regime"] == 1]["year"], df[df["e_boix_regime"] == 1]["SupDem"], marker="o", label="e_boix_regime==1"
)
# hence applying same condition for grabbing year which is applied for SupDem
plt.plot(
df[df["e_boix_regime"] == 2]["year"], df[df["e_boix_regime"] == 2]["SupDem"], marker="o", label="e_boix_regime==2"
)
plt.xlabel("Year")
plt.ylabel("SupDem")
plt.legend()
plt.show()
Output
PS: Ignore the data point plots, it's generated from random values
DataFrame Outputs
SupDem year e_boix_regime
0 1 12 2
1 10 10 1
2 5 19 2
3 4 14 2
4 8 14 2
5 4 17 2
6 2 15 2
7 10 11 1
8 8 11 2
9 6 19 2
10 5 15 1
11 8 17 1
12 9 10 2
13 1 14 2
14 8 18 1
15 3 13 2
16 6 16 2
17 1 16 1
18 7 13 1
19 8 15 2
20 2 17 2
21 5 10 2
22 1 19 2
23 5 20 2
24 7 16 1
25 10 14 1
26 2 11 2
27 1 18 1
28 5 16 1
29 10 18 2
df[df["e_boix_regime"] == 2][["SupDem", "year"]]
SupDem year
0 1 12
2 5 19
3 4 14
4 8 14
5 4 17
6 2 15
8 8 11
9 6 19
12 9 10
13 1 14
15 3 13
16 6 16
19 8 15
20 2 17
21 5 10
22 1 19
23 5 20
26 2 11
29 10 18

`groupby` - `qcut` but with condition

I have a dataframe as follow:
key1 key2 val
0 a x 8
1 a x 6
2 a x 7
3 a x 4
4 a x 9
5 a x 1
6 a x 2
7 a x 3
8 a x 10
9 a x 5
10 a y 4
11 a y 9
12 a y 1
13 a y 2
14 b x 17
15 b x 15
16 b x 18
17 b x 19
18 b x 12
19 b x 20
20 b x 14
21 b x 13
22 b x 16
23 b x 11
24 b y 2
25 b y 3
26 b y 10
27 b y 5
28 b y 4
29 b y 24
30 b y 22
​
What I need to do is:
Access each group by key1
In each group of key1, I need to do qcut on observations that key2 == x
For those observation that is out of bin range, assign them to lowest and highest bins
According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13
Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.
However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.
The output looks like this:
key1 key2 val labels
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)
dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]
Then, I intend to use groupby with apply, but get stuck on accessing dict_bins
data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)
Any other solution, or modification to my solution is appreciated.
Thank you
A first approach is to create a custom function:
def discretize(df):
bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
return pd.cut(df['val'], bins, labels=False)
df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
Output:
>>> df
key1 key2 val label
0 a x 8 1
1 a x 6 1
2 a x 7 1
3 a x 4 0
4 a x 9 1
5 a x 1 0
6 a x 2 0
7 a x 3 0
8 a x 10 1
9 a x 5 0
10 a y 4 0
11 a y 9 1
12 a y 1 0
13 a y 2 0
14 b x 17 1
15 b x 15 0
16 b x 18 1
17 b x 19 1
18 b x 12 0
19 b x 20 1
20 b x 14 0
21 b x 13 0
22 b x 16 1
23 b x 11 0
24 b y 2 0
25 b y 3 0
26 b y 10 0
27 b y 5 0
28 b y 4 0
29 b y 24 1
30 b y 22 1
You need to drop the first level of index to align indexes:
>>> df.groupby('key1').apply(discretize)
key1 # <- you have to drop this index level
a 0 1
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 0
11 1
12 0
13 0
b 14 1
15 0
16 1
17 1
18 0
19 1
20 0
21 0
22 1
23 0
24 0
25 0
26 0
27 0
28 0
29 1
30 1
Name: val, dtype: int64

Required data frame after explode or other option to fill a running difference b/w two columns pandas dataframe

Input data frame as given given below,
data = {
'labels': ["A","B","A","B","A","B","M","B","M","B","M"],
'start': [0,9,13,23,47,77,81,92,100,104,118],
'stop': [9,13,23,47,77,81,92,100,104,118,145],
}
df = pd.DataFrame.from_dict(data)
labels start stop
0 A 0 9
1 B 9 13
2 A 13 23
3 B 23 47
4 A 47 77
5 B 77 81
6 M 81 92
7 B 92 100
8 M 100 104
9 B 104 118
10 M 118 145
The output data frame required is as below,
Try this:
df['start'] = df.apply(lambda x: range(x['start'] + 1, x['stop'] + 1), axis=1)
df = df.explode('start')
Output:
>>> df
labels start stop
0 A 1 9
0 A 2 9
0 A 3 9
0 A 4 9
0 A 5 9
0 A 6 9
0 A 7 9
0 A 8 9
0 A 9 9
1 B 10 13
1 B 11 13
1 B 12 13
1 B 13 13
2 A 14 23
2 A 15 23
2 A 16 23
2 A 17 23
2 A 18 23
2 A 19 23
2 A 20 23
2 A 21 23
2 A 22 23
2 A 23 23
...

Keep only the first value on duplicated column (set 0 to others)

Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0

-Adding a column to a pandas dataframe that is sum of 3 different rows in another column AND slides those rows down like in excel

I have a data frame like:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
5 16 17 18
6 19 20 21
7 22 23 24
8 25 26 27
I'd like to add a column d that is the sum of column A row 0, column A row 2, and column A row 5.
I figured out how to do:
df['d']=df.loc[0,'a'] + df.loc[2,'a'] + df.loc[5,'a']
But the result is a static d tied to only those rows. I'd like a dynamic d, such that column d, row 2 is the sum of column a, row 1, column a, row3, and column a, row 6.
The end result should be:
a b c d
0 1 2 3 24
1 4 5 6 33
2 7 8 9 42
3 10 11 12 ---And so on
4 13 14 15 ---
5 16 17 18 ---
6 19 20 21 ---
7 22 23 24 ---
8 25 26 27 ---
Thanks for any help!
this is shift
df.a+df.a.shift(-2)+df.a.shift(-5)
Out[412]:
0 24.0
1 33.0
2 42.0
3 51.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
Name: a, dtype: float64
df['d']=df.a+df.a.shift(-2)+df.a.shift(-5)
df
Out[414]:
a b c d
0 1 2 3 24.0
1 4 5 6 33.0
2 7 8 9 42.0
3 10 11 12 51.0
4 13 14 15 NaN
5 16 17 18 NaN
6 19 20 21 NaN
7 22 23 24 NaN
8 25 26 27 NaN