pandas-groupby: apply custom function which needs 2 columns as input to get one column as output - pandas

I have a dataframe with dates and a value per day. I want to see the gradient of the value, if it is growing, declining, .... The best way is to apply a linear regression with day as x and value as y:
import pandas as pd
df = pd.DataFrame({'customer':['a','a','a','b','b','b'],
'day':[1,2,4,2,3,4],
'value':[1.5,2.4,3.6,1.5,1.3,1.1]})
df:
customer day value
0 a 1 1.5
1 a 2 2.4
2 a 4 3.6
3 b 2 1.5
4 b 3 1.3
5 b 4 1.1
By hand I can do a linear regression:
from sklearn.linear_model import LinearRegression
def gradient(x,y):
return LinearRegression().fit(x,y).coef_[0]
xa = df[df.customer =='a'].day.values.reshape(-1, 1)
ya = df[df.customer =='a'].value.values.reshape(-1, 1)
xb = df[df.customer =='b'].day.values.reshape(-1, 1)
yb = df[df.customer =='b'].value.values.reshape(-1, 1)
print(gradient(xa,ya),gradient(xb,yb))
result: [0.68571429] [-0.2]
But I would like to use a groupby as in
df.groupby('customer').agg({'value':['mean','sum','gradient']})
with an output like:
value
mean sum gradient
customer
a 2.5 7.5 0.685
b 1.3 3.9 -0.2
the issue is that the gradient needs 2 columns as input.

You can do:
# calculate gradient
v = (df
.groupby('customer')
.apply(lambda x: gradient(x['day'].to_numpy().reshape(-1, 1),
x['value'].to_numpy().reshape(-1, 1)))
v.name = 'gradient'
# calculate mean, sum
d1 = df.groupby('customer').agg({'value': ['mean', 'sum']})
# join the results
d1 = d1.join(v)
# fix columns
d1.columns = d1.columns.str.join('')
print(d1)
valuemean valuesum gradient
customer
a 2.5 7.5 0.685714
b 1.3 3.9 -0.200000

Related

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

Calculate nearest distance to certain points in python

I have a dataset as shown below, each sample has x and y values and the corresponding result
Sr. X Y Resut
1 2 12 Positive
2 4 3 positive
....
Visualization
Grid size is 12 * 8
How I can calculate the nearest distance for each sample from red points (positive ones)?
Red = Positive,
Blue = Negative
Sr. X Y Result Nearest-distance-red
1 2 23 Positive ?
2 4 3 Negative ?
....
dataset
Its a lot easier when there is sample data, make sure to include that next time.
I generate random data
import numpy as np
import pandas as pd
import sklearn
x = np.linspace(1,50)
y = np.linspace(1,50)
GRID = np.meshgrid(x,y)
grid_colors = 1* ( np.random.random(GRID[0].size) > .8 )
sample_data = pd.DataFrame( {'X': GRID[0].flatten(), 'Y':GRID[1].flatten(), 'grid_color' : grid_colors})
sample_data.plot.scatter(x="X",y='Y', c='grid_color', colormap='bwr', figsize=(10,10))
BallTree (or KDTree) can create a tree to query with
from sklearn.neighbors import BallTree
red_points = sample_data[sample_data.grid_color == 1]
blue_points = sample_data[sample_data.grid_color != 1]
tree = BallTree(red_points[['X','Y']], leaf_size=15, metric='minkowski')
and use it with
distance, index = tree.query(sample_data[['X','Y']], k=1)
now add it to the DataFrame
sample_data['nearest_point_distance'] = distance
sample_data['nearest_point_X'] = red_points.X.values[index]
sample_data['nearest_point_Y'] = red_points.Y.values[index]
which gives
X Y grid_color nearest_point_distance nearest_point_X \
0 1.0 1.0 0 2.0 3.0
1 2.0 1.0 0 1.0 3.0
2 3.0 1.0 1 0.0 3.0
3 4.0 1.0 0 1.0 3.0
4 5.0 1.0 1 0.0 5.0
nearest_point_Y
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Modification to have red point not find themself;
Find the nearest k=2 instead of k=1;
distance, index = tree.query(sample_data[['X','Y']], k=2)
And, with help of numpy indexing, make red points use the second instead of the first found;
sample_size = GRID[0].size
sample_data['nearest_point_distance'] = distance[np.arange(sample_size),sample_data.grid_color]
sample_data['nearest_point_X'] = red_points.X.values[index[np.arange(sample_size),sample_data.grid_color]]
sample_data['nearest_point_Y'] = red_points.Y.values[index[np.arange(sample_size),sample_data.grid_color]]
The output type is the same, but due to randomness it won't agree with earlier made picture.
cKDTree for scipy can calculate that distance for you. Something along those lines should work:
df['Distance_To_Red'] = cKDTree(coordinates_of_red_points).query((df['x'], df['y']), k=1)

pandas groupby and agg operation of selected columns and row

I have a dataframe as below:
I am not sure if it is possible to use pandas to make an output as below:
difference=Response[df.Time=="pre"]-Response.min for each group
If pre is always first per groups and values in output should be repeated:
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: (x.iat[0] - x).min())
For only first value per groups is possible replace values to empty strings, but get mixed values - numeric with strings, so next processing should be problem:
df['diff'] = df['diff'].mask(df['diff'].duplicated(), '')
EDIT:
df = pd.DataFrame({
'Response':[2,5,0.4,2,1,4],
'Time':[7,'pre',9,4,2,'pre'],
'IDs':list('aaabbb')
})
#print (df)
d = df[df.Time=="pre"].set_index('IDs')['Response'].to_dict()
print (d)
{'a': 5.0, 'b': 4.0}
df['diff'] = df.groupby('IDs')['Response'].transform(lambda x: d[x.name] - x.min())
print (df)
Response Time IDs diff
0 2.0 7 a 4.6
1 5.0 pre a 4.6
2 0.4 9 a 4.6
3 2.0 4 b 3.0
4 1.0 2 b 3.0
5 4.0 pre b 3.0

How do I use grouped data to plot rainfall averages in specific hourly ranges

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

Pandas Creating Normal Dist series

I'm trying to convert an excel "normal distribution" formula into python.
(1-NORM.DIST(a+col,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE)))
For example: Here's my given df
Id a b c
ijk 4 3.5 12.53
xyz 12 3 10.74
My goal:
Id a b c 0 1 2 3
ijk 4 3.5 12.53 1 .93 .87 .81
xyz 12 3 10.74 1 .87 .76 .66
Here's the math behind it:
column 0: always 1
column 1: (1-NORM.DIST(a+1,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 2: (1-NORM.DIST(a+2,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
column 3: (1-NORM.DIST(a+3,b,c,TRUE))/(1-NORM.DIST(a,b,c,TRUE))
This is what I have so far:
df1 = pd.DataFrame(df, columns=np.arange(0,4))
result = pd.concat([df, df1], axis=1, join_axes=[df.index])
result[0] = 1
I'm not sure what to do after this.
This is how I use the normal distribution function:
https://support.office.com/en-us/article/normdist-function-126db625-c53e-4591-9a22-c9ff422d6d58
Many many thanks!
NORM.DIST(..., TRUE) means the cumulative distribution function and 1 - NORM.DIST(..., TRUE) means the survival function. These are available under scipy's stats module (see ss.norm). For example,
import scipy.stats as ss
ss.norm.cdf(4, 3.5, 12.53)
Out:
0.51591526057026538
For your case, you can first define a function:
def normalize(a, b, c, col):
return ss.norm.sf(a+col, b, c) / ss.norm.sf(a, b, c)
and call that function with apply:
for col in range(4):
df[col] = df.apply(lambda x: normalize(x.a, x.b, x.c, col), axis=1)
df
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303
This is not the most efficient approach as it calculates the survival function for same values again and involves two loops. One level of loops can be omitted by passing an array of values to ss.sf:
out = df.apply(
lambda x: pd.Series(
ss.norm.sf(x.a + np.arange(4), x.b, x.c) / ss.norm.sf(x.a, x.b, x.c)
), axis=1
)
Out:
0 1 2 3
0 1.0 0.934455 0.869533 0.805636
1 1.0 0.875050 0.760469 0.656303
And you can use join to add this to your original DataFrame:
df.join(out)
Out:
Id a b c 0 1 2 3
0 ijk 4 3.5 12.53 1.0 0.934455 0.869533 0.805636
1 xyz 12 3.0 10.74 1.0 0.875050 0.760469 0.656303