How do I use grouped data to plot rainfall averages in specific hourly ranges - pandas

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1

It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html

It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

Related

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

Can't get y-axis on matplotlib histogram to display the right numbers

So I have this simple DataFrame which i am trying to plot a histogram with
Hour Count Average Count
2 6 4 0.129032
4 7 1 0.032258
1 12 9 0.290323
3 16 3 0.096774
0 20 2022 65.225806
What I want is the Hour to be on the x-axis and Average Count to be on the Y axis. But when i tried this:
fig, hour = plt.subplots(1, 1)
hour.hist(test.Hour)
hour.set_xlabel('Time in 24 Hours')
hour.set_ylabel('Frequency')
plt.show()
I got this instead. I have tried doing test.Count and test['Average Count'] but both only affects the x-axis
Are you looking for something like this?
'df' is the name of the dataframe.
df.plot(x='Hour', y = 'Averag Count', kind='bar')
Output

pandas – "multiplication table" for each row with custom function

I have a DataFrame with cities coordinates, like this (example):
x y
A 10 20
B 20 30
C 15 60
I want to calculate their distance : sqrt(x^2 + y^2) from each other with sort of a multiplication table (example):
A B C
A 0 20 30
B 20 0 25
C 30 25 0
How can I do this? I've tried using apply function but need some guidance.
You can make use of the broadcasting feature in pandas, together with .apply():
df['distance'] = (df['x'] ** 2 + df['y'] ** 2).apply(np.sqrt)
The easiest way is to use distance_matrix of scipy:
from scipy.spatial import distance_matrix
df = pd.DataFrame({'x':[10,20,30], 'y': [20,30,60]},index=list('ABC'))
pd.DataFrame(distance_matrix(df,df), index=df.index, columns=df.index)
Output:
A B C
A 0.000000 14.142136 40.311289
B 14.142136 0.000000 30.413813
C 40.311289 30.413813 0.000000

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:
Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

Best way of showing dates in a Bar plot with Pandas

I create a bar plot like this:
But since each x axis label is one day of january (for example 1, 3, 4, 5, 7, 8, etc) I think the best way of showing this is something like
__________________________________________________ x axis
Jan 1 3 4 5 7 8 ...
2019
But I dont know how to do this with Pandas.
Here is my code:
import pandas as pd
import matplotlib.plt as plt
df = pd.read_excel('solved.xlsx', sheetname="Jan")
fig, ax = plt.subplots()
df_plot=df.plot(x='Date', y=['Solved', 'POT'], ax=ax, kind='bar',
width=0.9, color=('#ffc114','#0098c9'), label=('Solved','POT'))
def line_format(label):
"""
Convert time label to the format of pandas line plot
"""
month = label.month_name()[:3]
if month == 'Jan':
month += f'\n{label.year}'
return month
ax.set_xticklabels(map(lambda x: line_format(x), df.Date))
The function was a solution provided here: Pandas bar plot changes date format
I dont know how to modify it to get the axis I want
My data example solved.xlsx:
Date Q A B Solved POT
2019-01-04 Q4 11 9 14 5
2019-01-05 Q4 9 11 14 5
2019-01-08 Q4 11 18 10 6
2019-01-09 Q4 18 19 18 5
I have found a solution:
import pandas as pd
import matplotlib.plt as plt
df = pd.read_excel('solved.xlsx', sheetname="Jan")
fig, ax = plt.subplots()
df_plot=df.plot(x='Date', y=['Solved', 'POT'], ax=ax, kind='bar',
width=0.9, color=('#ffc114','#0098c9'), label=('Solved','POT'))
def line_format(label):
"""
Convert time label to the format of pandas line plot
"""
day = label.day
if day == 2:
day = str(day) + f'\n{label.month_name()[:3]}'
return day
ax.set_xticklabels(map(lambda x: line_format(x), df.Date))
plt.show()
In my particular case I didnt have the date 2019-01-01 . So the first day for me was Jan 2