Make bins coarser on pandas dataframe, and sum in counting columns - pandas

I have a dataframe with a variable (E), where the value in the dataframe is the left edge of the bin, and a set of occupancies for each bin (n) (and the uncertainty squared (v)). At the moment, these are binned from 200 to 2000 in steps of 100 (usually), then binned 2000 to +inf. However these bins are very fine for the plotting I need to perform, and I need to rebin these into 200, 300, 400, 600, 1000, +inf.
Key Point: Because I am reading several sets of data like this from a source, not all my dataframes have entries e.g. for bin 600-700, i.e. some rows will be missing from one dataframe, while another may have entries for them. I need to rebin and sum n and v based on the new bins, while accounting for the fact that my dataframes aren't "regular".
Here's an example dataframe:
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 62.0 2.5
3 500.0 55.0 2.2
4 600.0 24.0 1.7
5 800.0 12.0 1.3
6 900.0 8.0 0.9
7 1000.0 4.0 0.6
8 1100.0 1.0 0.2
And here is my desired output:
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 117.0 4.7
3 600.0 44.0 3.9
4 1000.0 5.0 0.8
Any help or guidance is much appreciated.

You can cut with agg
s=df.groupby(pd.cut(df.E,[200,300,400,600,1000,np.inf],right=False)).agg({'E':'first','n':'sum','v':'sum'})
s.E=s.index.map(lambda x :x.left)
s.reset_index(drop=True,inplace=True)
s
E n v
0 200.0 26.0 1.3
1 300.0 56.0 2.2
2 400.0 117.0 4.7
3 600.0 44.0 3.9
4 1000.0 5.0 0.8

Related

find minimum value in a column based on condition in an another column of a dataframe?

I want to select the minimum value of a column range based on another column condition
0 1 2 3 4 Capacity Fixed Cost
80.0 270.0 250.0 160.0 180.0 NaN NaN
4.0 5.0 6.0 8.0 10.0 500.0 1000.0
6.0 4.0 3.0 5.0 8.0 500.0 1000.0
9.0 7.0 4.0 3.0 4.0 500.0 1000.0
I get the minimum value of the column with dv.loc[1:, i].min()
but i want to exclude rows where the capacity is 0
IIUC use:
df[df['Capacity'].ne(0)].min()

Based on some rules, how to expand data in Pandas?

Please forgive my English. I hope I can say clearly.
Assume we have this data:
>>> data = {'Span':[3,3.5], 'Low':[6.2,5.16], 'Medium':[4.93,4.1], 'High':[3.68,3.07], 'VeryHigh':[2.94,2.45], 'ExtraHigh':[2.48,2.06], '0.9':[4.9,3.61], '1.5':[3.23,2.38], '2':[2.51,1.85]}
>>> df = pd.DataFrame(data)
>>> df
Span Low Medium High VeryHigh ExtraHigh 0.9 1.5 2
0 3.0 6.20 4.93 3.68 2.94 2.48 4.90 3.23 2.51
1 3.5 5.16 4.10 3.07 2.45 2.06 3.61 2.38 1.85
I want to get this data:
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
The principles apply to df:
Span expands by the combination of Wind and Snow to get the MaxSpacing
Wind and Snow is mutually exclusive. When Wind is one of 'Low', 'Medium', 'High', 'VeryHigh', 'ExtraHigh', Snow is zero; when Snow is one of 0.9, 1.5, 2, Wind is zero.
Please help. Thank you.
Use DataFrame.melt for unpivot and then sorting by indices, create Snow column by to_numeric and Series.fillna in DataFrame.insert and last set 0 for Wind column:
df = (df.melt('Span', ignore_index=False, var_name='Wind', value_name='MaxSpacing')
.sort_index(ignore_index=True))
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
Alternative solution with DataFrame.set_index and DataFrame.stack:
df = df.set_index('Span').rename_axis('Wind', axis=1).stack().reset_index(name='MaxSpacing')
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final

How to have uniform xticks from multiple dataframes with different order

I want to have several different dataframes plotted on a graph. Which would look something like:
And each different plots are from different dataframes. The dataframes look something like (they have many more columns):
a b c d e f g h i j
0 27.0 1.74 12.63 50.0 Sejuani 18.0 5.28 1.17 6.48
1 22.0 1.58 17.00 56.0 Kalista 12.0 10.92 2.58 3.94
2 20.0 1.77 18.25 58.3 Gangplank 11.0 9.56 1.71 4.77
3 17.0 1.72 16.87 60.5 Ryze 12.0 10.75 2.56 4.02
7 27.0 1.61 11.08 54.2 Braum 11.0 2.68 0.25 6.32
8 28.0 1.73 16.36 53.3 Azir 13.0 10.35 3.07 3.13
9 29.0 1.83 16.56 55.4 Gnar 11.0 9.49 1.71 3.83
16 35.0 1.23 17.72 52.1 Ezreal 11.0 10.5 2.23 4.10
By default, the dataframes are in descending order according to a column that contains float values. I want to plot a column with the x-axis of the names. Current code I have is:
indices = [x for x in range(0, 30)]
x_tick = list(df_LCK['name'])
plt.xticks(indices, x_tick, rotation='vertical')
plt.plot(indices, df_LCK['pb'][:30])
plt.show()
And since top 30 items from each dataframes are all different, the xticks I declared above is only correct for one plot, and not matching for the other four.
Best would be reordering the dataframes so it's in alphabetical order, and in descending order according to a column.
How would I have it so the x-ticks match the y-value for the rest of the plots?
Is it to be done while plotting? Or is there a way to change the order of the dataframes' indices so they are in the way I want?

Pandas Add Values in Interval By Specific Increment

Is there a way using pandas functions to add values/rows by a particular increment?
For example:
This is what I have:
df = pd.DataFrame([1.1,2,2.8])
df
value other1 other2
zebra 0.3 250
bunny 0.7 10
rat 1.0 35
cat 1.1 100
dog 2.0 150
mouse 2.8 125
EDIT 1:
This is what I want, where ideally the inserted rows' index are whatever is easiest but the previous row names are preserved.
df_goal = pd.DataFrame([1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8])
df_goal
value other1 other2
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
1 1.2
2 1.3
3 1.4
4 1.5
5 1.6
6 1.7
7 1.8
8 1.9
dog 2.0 150
10 2.1
11 2.2
12 2.3
13 2.4
14 2.5
15 2.6
16 2.7
mouse 2.8 125
EDIT 2:
Also I would like to keep the values of other columns that were there previously and any new rows are simply empty or zero.
I think you can use reindex by numpy.arange:
#create index by value column
df = df.reset_index().set_index('value')
#reindex floatindex
s = 0.1
a = np.arange(df.index.min(),df.index.max() + s, step=s)
df = df.reindex(a, tolerance=s/2., method='nearest')
#replace NaN in another columns as index
cols = df.columns.difference(['index'])
df[cols] = df[cols].fillna('')
#replace NaN by range
s = pd.Series(np.arange(len(df.index)), index=df.index)
df['index'] = df['index'].combine_first(s)
#swap column with index
df = df.reset_index().set_index('index')
print (df)
value other1 other2
index
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
9 1.2
10 1.3
11 1.4
12 1.5
13 1.6
14 1.7
15 1.8
16 1.9
dog 2.0 150
18 2.1
19 2.2
20 2.3
21 2.4
22 2.5
23 2.6
24 2.7
mouse 2.8 125