Finding greatest fall and rise in a dynamic rolling window based on index - pandas

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:

Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

Related

index compatibility of dataframe with multiindex result from apply on group

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.
Following code reflects our approach to the problem in a simplified manner.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)
This generates a DataFrame as follows.
key x
0 0 0.969585
1 1 0.775133
2 1 0.939499
3 1 0.894827
4 1 0.597900
.. ... ...
95 53 0.036887
96 54 0.609564
97 55 0.502679
98 56 0.051479
99 56 0.278646
Application of exemplary methods on the DataFrame groups.
def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))
When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.
TypeError: incompatible index of inserted column with frame index
Thus a compatible multiindex needs to be introduced first.
df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)
Which yields the intended result.
key x y
index
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
... ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.
Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:
g = df.groupby('key')
df['y'] = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key x y
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
.. ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
[100 rows x 3 columns]

How do I use grouped data to plot rainfall averages in specific hourly ranges

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

How to create a function to convert monthly data to daily, weekly in pandas dataframe?

I have the below monthly data in the dataframe and I need to convert the data to weekly, daily, biweekly.
date chair_price vol_chair
01-09-2018 23 30
01-10-2018 53 20
daily: price as same and vol_chair divided by days of the month
weekly: price as same and vol_chair divided by number of weeks in a month
expected output:
daily:
date chair_price vol_chair
01-09-2018 23 1
02-09-2018 23 1
03-09-2018 23 1
..
30-09-2018 23 1
01-10-2018 53 0.64
..
31-10-2018 53 0.64
weekly:
date chair_price vol_chair
02-09-2018 23 6
09-09-2018 23 6
16-09-2018 23 6
23-09-2018 23 6
30-09-2018 23 6
07-10-2018 53 5
14-10-2018 53 5
..
I am using below code as for column vol, any quick way to do it together i.e. keep price same and vol - take action and find number of weeks in a month
df.resample('W').ffill().agg(lambda x: x/4)
df.resample('D').ffill().agg(lambda x: x/30)
and need to use calendar.monthrange(2012,1)[1] to identify days
def func_count_number_of_weeks(df):
return len(calendar.monthcalendar(df['DateRange'].year, df['DateRange'].month))
def func_convert_from_monthly(df, col, category, columns):
if category == "Daily":
df['number_of_days'] = df['DateRange'].dt.daysinmonth
for column in columns:
df[column] = df[column] / df['number_of_days']
df.drop('number_of_days', axis=1, inplace=True)
elif category == "Weekly":
df['number_of_weeks'] = df.apply(func_count_number_of_weeks, axis=1)
for column in columns:
df[column] = df[column] / df['number_of_weeks']
df.drop('number_of_weeks', axis=1, inplace=True)
return df
def func_resample_from_monthly(df,col, category):
df.set_index(col, inplace=True)
df.index = pd.to_datetime(df.index, dayfirst=True)
if category == "Monthly":
df = df.resample('MS').ffill()
elif category == "Weekly":
df = df.resample('W').ffill()
return df
Use:
#convert to datetimeindex
df.index = pd.to_datetime(df.index, dayfirst=True)
#add new next month for correct resample
idx = df.index[-1] + pd.offsets.MonthBegin(1)
df = df.append(df.iloc[[-1]].rename({df.index[-1]: idx}))
#resample with forward filling values, remove last helper row
#df1 = df.resample('D').ffill().iloc[:-1]
df1 = df.resample('W').ffill().iloc[:-1]
#divide by size of months
df1['vol_chair'] /= df1.resample('MS')['vol_chair'].transform('size')
print (df1)
chair_price vol_chair
date
2018-09-02 23 6.0
2018-09-09 23 6.0
2018-09-16 23 6.0
2018-09-23 23 6.0
2018-09-30 23 6.0
2018-10-07 53 5.0
2018-10-14 53 5.0
2018-10-21 53 5.0
2018-10-28 53 5.0

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?
IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0

Condition in Pandas

I have a very peculiar problem in Pandas: one condition works but the other does not. You may download the linked file to test my code. Thanks!
I have a file (stars.txt) that I read in with Pandas. I would like to create two groups: (1) with Log_g < 4.0 and (2) Log_g > 4.0. In my code (see below) I can successfully get rows for group (1):
Kepler_ID RA Dec Teff Log_G g H
3 2305372 19 27 57.679 +37 40 21.90 5664 3.974 14.341 12.201
14 2708156 19 21 08.906 +37 56 11.44 11061 3.717 10.672 10.525
19 2997455 19 32 31.296 +38 07 40.04 4795 3.167 14.694 11.500
34 3352751 19 36 17.249 +38 25 36.91 7909 3.791 13.541 12.304
36 3440230 19 21 53.100 +38 31 42.82 7869 3.657 13.706 12.486
But for some reason I cannot get (2). The code returns the following for error:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 108
Data columns (total 7 columns):
Kepler_ID 90 non-null values
RA 90 non-null values
Dec 90 non-null values
Teff 90 non-null values
Log_G 90 non-null values
g 90 non-null values
H 90 non-null values
dtypes: float64(4), int64(1), object(2)
Here's my code:
#------------------------------------------
# IMPORT STATEMENTS
#------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#------------------------------------------
# READ FILE AND ASSOCIATE COMPONENTS
#------------------------------------------
star_file = 'stars.txt'
header_row = ['Kepler_ID', 'RA','Dec','Teff', 'Log_G', 'g', 'H']
df = pd.read_csv(star_file, names=header_row, skiprows=2)
#------------------------------------------
# ASSOCIATE VARIABLES
#------------------------------------------
Kepler_ID = df['Kepler_ID']
#RA = df['RA']
#Dec = df['Dec']
Teff = df['Teff']
Log_G = df['Log_G']
g = df['g']
H = df['H']
#------------------------------------------
# SUBSTITUTE MISSING DATA WITH NAN
#------------------------------------------
df = df.replace('', np.nan)
#------------------------------------------
# CHANGE DATA TYPE OF THE REST OF DATA TO FLOAT
#------------------------------------------
df[['Teff', 'Log_G', 'g', 'H']] = df[['Teff', 'Log_G', 'g', 'H']].astype(float)
#------------------------------------------
# SORTING SPECTRA TYPES FOR GIANTS
#------------------------------------------
# FIND GIANTS IN THE SAMPLE
giants = df[(df['Log_G'] < 4.)]
#print giants
# FIND GIANTS IN THE SAMPLE
dwarfs = df[(df['Log_G'] > 4.)]
print dwarfs
This is not an error. You are seeing a summarized view of the DataFrame:
In [11]: df = pd.DataFrame([[2, 1], [3, 4]])
In [12]: df
Out[12]:
0 1
0 2 1
1 3 4
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null values
1 2 non-null values
dtypes: int64(2)
Which is displayed is decided by several display package options, for example, max_rows:
In [14]: pd.options.display.max_rows
Out[14]: 60
In [15]: pd.options.display.max_rows = 120
In 0.13, this behaviour changed, so you will see the first max_rows followed by ....