numpy: aggregate 4D array by groups - numpy

I have a numpy array with shape [t, z, x, y] epresenting an hourly time series of three-D data. The axes of the array are time, vertical coordinate, horizontal coordinate 1, horizontal coordinate 2. There is also a t-element list of hourly datetime.datetime timestamps.
I want to calculate the daily mid-day means for each day. This will be an [nday, Z, X, Y] array.
I'm trying to find a pythonic way to do this. I've written something with a bunch of for loops that works but seems slow, inflexible, and verbose.
It appears to me that Pandas is not a solution for me because my time series data are three-dimensional. I'd be happy to be proven wrong.
I've come up with this, using itertools, to find mid-day timestamps and group them by date, and now I'm coming up short trying to apply imap to find the means.
import numpy as np
import pandas as pd
import itertools
# create 72 hours of pseudo-data with 3 vertical levels and a 4 by 4
# horizontal grid.
data = np.zeros((72, 3, 4, 4))
t = pd.date_range(datetime(2008,7,1), freq='1H', periods=72)
for i in range(data.shape[0]):
data[i,...] = i
# find the timestamps that are "midday" in North America. We'll
# define midday as between 15:00 and 23:00 UTC, which is 10:00 EST to
# 15:00 PST.
def is_midday(this_t):
return ((this_t.hour >= 15) and (this_t.hour <= 23))
# group the midday timestamps by date
for dt, grp in itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date()):
print 'date ' + str(dt)
for g in grp:
print g
# find means of mid-day data by date
data_list = np.split(data, data.shape[0])
grps = itertools.groupby(itertools.ifilter(is_midday, t),
key=lambda x: x.date())
# how to apply itertools.imap (or something else) to data_list and
# grps? Or somehow split data along axis 0 according to grps?

You can shove pretty much any object into a pandas structure. Normally not recommended, but in this case it might work for you.
Create a Series indexed by time, with each element a 3-d numpy array
In [117]: s = Series([data[i] for i in range(data.shape[0])],index=t)
In [118]: s
Out[118]:
2008-07-01 00:00:00 [[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], ...
2008-07-01 01:00:00 [[[1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0], ...
2008-07-01 02:00:00 [[[2.0, 2.0, 2.0, 2.0], [2.0, 2.0, 2.0, 2.0], ...
2008-07-01 03:00:00 [[[3.0, 3.0, 3.0, 3.0], [3.0, 3.0, 3.0, 3.0], ...
2008-07-01 04:00:00 [[[4.0, 4.0, 4.0, 4.0], [4.0, 4.0, 4.0, 4.0], ...
2008-07-01 05:00:00 [[[5.0, 5.0, 5.0, 5.0], [5.0, 5.0, 5.0, 5.0], ...
2008-07-01 06:00:00 [[[6.0, 6.0, 6.0, 6.0], [6.0, 6.0, 6.0, 6.0], ...
2008-07-01 07:00:00 [[[7.0, 7.0, 7.0, 7.0], [7.0, 7.0, 7.0, 7.0], ...
2008-07-01 08:00:00 [[[8.0, 8.0, 8.0, 8.0], [8.0, 8.0, 8.0, 8.0], ...
2008-07-01 09:00:00 [[[9.0, 9.0, 9.0, 9.0], [9.0, 9.0, 9.0, 9.0], ...
2008-07-01 10:00:00 [[[10.0, 10.0, 10.0, 10.0], [10.0, 10.0, 10.0,...
2008-07-01 11:00:00 [[[11.0, 11.0, 11.0, 11.0], [11.0, 11.0, 11.0,...
2008-07-01 12:00:00 [[[12.0, 12.0, 12.0, 12.0], [12.0, 12.0, 12.0,...
2008-07-01 13:00:00 [[[13.0, 13.0, 13.0, 13.0], [13.0, 13.0, 13.0,...
2008-07-01 14:00:00 [[[14.0, 14.0, 14.0, 14.0], [14.0, 14.0, 14.0,...
...
2008-07-03 09:00:00 [[[57.0, 57.0, 57.0, 57.0], [57.0, 57.0, 57.0,...
2008-07-03 10:00:00 [[[58.0, 58.0, 58.0, 58.0], [58.0, 58.0, 58.0,...
2008-07-03 11:00:00 [[[59.0, 59.0, 59.0, 59.0], [59.0, 59.0, 59.0,...
2008-07-03 12:00:00 [[[60.0, 60.0, 60.0, 60.0], [60.0, 60.0, 60.0,...
2008-07-03 13:00:00 [[[61.0, 61.0, 61.0, 61.0], [61.0, 61.0, 61.0,...
2008-07-03 14:00:00 [[[62.0, 62.0, 62.0, 62.0], [62.0, 62.0, 62.0,...
2008-07-03 15:00:00 [[[63.0, 63.0, 63.0, 63.0], [63.0, 63.0, 63.0,...
2008-07-03 16:00:00 [[[64.0, 64.0, 64.0, 64.0], [64.0, 64.0, 64.0,...
2008-07-03 17:00:00 [[[65.0, 65.0, 65.0, 65.0], [65.0, 65.0, 65.0,...
2008-07-03 18:00:00 [[[66.0, 66.0, 66.0, 66.0], [66.0, 66.0, 66.0,...
2008-07-03 19:00:00 [[[67.0, 67.0, 67.0, 67.0], [67.0, 67.0, 67.0,...
2008-07-03 20:00:00 [[[68.0, 68.0, 68.0, 68.0], [68.0, 68.0, 68.0,...
2008-07-03 21:00:00 [[[69.0, 69.0, 69.0, 69.0], [69.0, 69.0, 69.0,...
2008-07-03 22:00:00 [[[70.0, 70.0, 70.0, 70.0], [70.0, 70.0, 70.0,...
2008-07-03 23:00:00 [[[71.0, 71.0, 71.0, 71.0], [71.0, 71.0, 71.0,...
Freq: H, Length: 72
Define your aggregating function. You need to access the values which returns the inside object; concatenating coerces back to an actual numpy array, then aggregate (mean in this case)
In [119]: def f(g,grp):
.....: return np.concatenate(grp.values).mean()
.....:
Since not sure what your end output should look like, just create a time-based grouper manually (this is essentially a resample), but doesn't do anything with the final results (its just a list of the aggregated values)
In [121]: [ f(g,grp) for g, grp in s.groupby(pd.Grouper(freq='D')) ]
Out[121]: [11.5, 35.5, 59.5]
You can get reasonable fancy here and say return a pandas object (and potentially concat them).

Related

Reindex pandas DataFrame to match index with another DataFrame

I have two pandas DataFrames with different (float) indices.
I want to update the second dataframe to match the first dataframe's index, updating its values to be interpolated using the index.
This is the code I have:
from pandas import DataFrame
df1 = DataFrame([
{'time': 0.2, 'v': 1},
{'time': 0.4, 'v': 2},
{'time': 0.6, 'v': 3},
{'time': 0.8, 'v': 4},
{'time': 1.0, 'v': 5},
{'time': 1.2, 'v': 6},
{'time': 1.4, 'v': 7},
{'time': 1.6, 'v': 8},
{'time': 1.8, 'v': 9},
{'time': 2.0, 'v': 10}
]).set_index('time')
df2 = DataFrame([
{'time': 0.25, 'v': 1},
{'time': 0.5, 'v': 2},
{'time': 0.75, 'v': 3},
{'time': 1.0, 'v': 4},
{'time': 1.25, 'v': 5},
{'time': 1.5, 'v': 6},
{'time': 1.75, 'v': 7},
{'time': 2.0, 'v': 8},
{'time': 2.25, 'v': 9}
]).set_index('time')
df2 = df2.reindex(df1.index.union(df2.index)).interpolate(method='index').reindex(df1.index)
print(df2)
Output:
v
time
0.2 NaN
0.4 1.6
0.6 2.4
0.8 3.2
1.0 4.0
1.2 4.8
1.4 5.6
1.6 6.4
1.8 7.2
2.0 8.0
That's correct and as I need - however it seems a more complicated statement than it needs to be.
If there a more concise way to do the same, requiring fewer intermediate steps?
Also, is there a way to both interpolate and extrapolate? For example, in the example data above, the linearly extrapolated value for index 0.2 could be 0.8 instead of NaN. I know I could curve_fit, but again I feel that's more complicated that it may need to be?
One idea with numpy.interp, if values in both indices are increased and processing only one column v:
df1['v1'] = np.interp(df1.index, df2.index, df2['v'])
print(df1)
v v1
time
0.2 1 1.0
0.4 2 1.6
0.6 3 2.4
0.8 4 3.2
1.0 5 4.0
1.2 6 4.8
1.4 7 5.6
1.6 8 6.4
1.8 9 7.2
2.0 10 8.0

Is there a better way to yield a stats for grade == 0, 0 < grade < 60, 60 =< grade < 70 and so on?

I'm trying to yield a stats for several some kind of "bins". Namely, how many students getting grade 0, how many students getting grade that is greater than 0 and less than 60 ...
I'm not sure if they are bins as they are not equally segmented.
grade == 0
0 < grade < 60
60 <= grade < 70
...
Here is the code
grade_list = [87.5, 87.5, 65.0, 90.0, 72.5, 65.0, 0.0, 65.0, 72.5, 65.0, 72.5, 65.0, 90.0, 90.0, 87.5, 87.5, 87.5, 65.0, 87.5, 65.0, 65.0, 90.0, 99.0, 65.0, 87.5, 65.0, 87.5, 90.0, 87.5, 90.0, 90.0, 0.0, 90.0, 99.0, 65.0, 87.5, 72.5, 72.5, 90.0, 0.0, 65.0, 72.5, 90.0, 90.0, 65.0, 90.0, 90.0, 65.0, 65.0, 0.0, 90.0, 90.0, 100.0, 99.0, 65.0, 90.0, 90.0, 0.0, 99.0, 90.0, 100.0, 87.5, 65.0, 99.0, 0.0, 90.0, 65.0, 90.0, 65.0, 99.0, 90.0, 65.0, 100.0, 65.0, 90.0, 99.0]
print(len(df[df['grade']==0]))
print(len(df[(df['grade']>0)&(df['grade']<60)]))
print(len(df[(df['grade']>=60)&(df['grade']<70)]))
print(len(df[(df['grade']>=70)&(df['grade']<80)]))
print(len(df[(df['grade']>=80)&(df['grade']<90)]))
print(len(df[(df['grade']>=90)]))
I got what I want. The code seems ugly though. Is there a better way to do the job?
Try this
df['category'] = (df['grade']/10).astype(int)
#This bit converts categories between 0 and 6 into 1. So the categories you now have are 0, 1, 6, 7.., 10
df['category'] = np.where((df.category > 0) & (df.category < 6), 1, df.category)
for i in range(max(df.category)+1):
if len(df[df['category']==i]) > 0:
print(i, len(df[df['category']==i]))
This will give you categories like the values you want and print out the number of rows in those categories.
The if statement is just to avoid blank rows like you did in your snippet, but can remove it.
Output-
The dataframe-
grade category
0 87.5 8
1 87.5 8
2 65.0 6
3 90.0 9
4 72.5 7
.. ... ...
71 65.0 6
72 100.0 10
73 65.0 6
74 90.0 9
75 99.0 9
Sizes of each bin-
0 6
6 21
7 6
8 11
9 29
10 3
IIUC, you can use pandas.cut, tweaking it a bit to handle the 0 as separate group:
df = pd.DataFrame({'grade': grade_list})
bins = [0,60,70,80,90]
labels = [f'≥{x}' if x>0 else f'>{x}' for x in bins]
df['bin'] = pd.cut(df['grade'].replace(0, -1),
bins=[float('-inf')]+bins+[float('inf')],
labels=['0']+labels,
right=False)
output (added two points for the example):
grade bin
0 87.5 ≥80
1 87.5 ≥80
2 65.0 ≥60
3 90.0 ≥90
4 72.5 ≥70
.. ... ...
73 65.0 ≥60
74 90.0 ≥90
75 99.0 ≥90
76 10.0 >0
77 0.0 0
[78 rows x 2 columns]

How to create a nested dictionary from pandas dataframe and again convert it to dataframe?

import pandas as pd
import numpy as np
d = {
'Fruit':['Guava','Orange','Lemon'],
'ID1':[1,2,11],
'ID2':[3,4,12],
'ID3':[5,6,np.nan],
'ID4':[7,8,14],
'ID5':[9,10,np.nan],
'ID6':[11,np.nan,np.nan],
'ID7':[13,np.nan,np.nan],
'ID8':[15,np.nan,np.nan],
'ID9':[17,np.nan,np.nan],
'Category':['Myrtaceae','Citrus','Citrus']
}
df = pd.DataFrame(data = d)
df
How to convert the above dataframe to the following dictionary.
Expected Output :
{
'Myrtacease':{'Guava':{1,3,5,7,9,11,13,15,17}},
'Citrus':{'Orange':{2,4,6,8,10,np.nan,np.nan,np.nan,np.nan},{'Lemon':{11,12,np.nan,14,np.nan,np.nan,np.nan,np.nan,np.nan}},
}
How to again convert the dictionary to a dataframe?
Use list comprehension with groupby:
d = {k: v.set_index('Fruit').T.to_dict('list')
for k, v in df.set_index('Category').groupby(level=0)}
print (d)
{'Citrus': {'Orange': [2.0, 4.0, 6.0, 8.0, 10.0, nan, nan, nan, nan],
'Lemon': [11.0, 12.0, nan, 14.0, nan, nan, nan, nan, nan]},
'Myrtaceae': {'Guava': [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0]}}
Or:
d = {k: v.drop('Category', axis=1).set_index('Fruit').T.to_dict('list')
for k, v in df.groupby('Category')}
And then:
df = (pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
.T
.rename_axis(('Category','Fruit'))
.rename(columns=lambda x: f'ID{x+1}')
.reset_index())
print (df)
Category Fruit ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9
0 Citrus Orange 2.0 4.0 6.0 8.0 10.0 NaN NaN NaN NaN
1 Citrus Lemon 11.0 12.0 NaN 14.0 NaN NaN NaN NaN NaN
2 Myrtaceae Guava 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0

Retain values in a Pandas dataframe

Consider the following Pandas Dataframe:
_df = pd.DataFrame([
[4.0, "Diastolic Blood Pressure", 1.0, "2017-01-15", 68],
[4.0, "Diastolic Blood Pressure", 5.0, "2017-04-15", 60],
[4.0, "Diastolic Blood Pressure", 8.0, "2017-06-18", 68],
[4.0, "Heart Rate", 1.0, "2017-01-15", 85],
[4.0, "Heart Rate", 5.0, "2017-04-15", 72],
[4.0, "Heart Rate", 8.0, "2017-06-18", 81],
[6.0, "Diastolic Blood Pressure", 1.0, "2017-01-18", 114],
[6.0, "Diastolic Blood Pressure", 6.0, "2017-02-18", 104],
[6.0, "Diastolic Blood Pressure", 9.0, "2017-03-18", 124]
], columns = ['ID', 'VSname', 'Visit', 'VSdate', 'VSres'])
I'd like to create the 'Flag' variable in this df: for each ID and VSName, show the difference from baseline (visit 1) at each visit.
I tried different approaches and I'm stuck.
I come from a background of SAS programming, and that'd be very easy in SAS to retain values from a row to another, and then substract. I'm sure my mind is poluted by SAS (and the title is clearly wrong), but this has to be doable with Pandas, one way or another. Any idea?
Thanks a lot for your help.
Kind regards,
Nicolas
Assuming the DataFrame is ordered by id and visit group (i.e. the 5, 8 and directly after the 1), you can use cumcount:
c = (df.visit == 1).cumcount()
You can subtract VSRes from the first VSRes entry of each group:
df.VSRes - df.groupby(c).VSRes.transform("first")
I tried the answers kindly given, none worked, got errors I coudln't fix. Not sure why... I managed to produced something close using the following:
baseline = df[df["Visit"] == 1.0]
baseline = baseline.rename(columns={'VSres': 'baseline'})
df = pd.merge(df, baseline, on = ["ID", "VSname"], how='left')
df["chg"] = df["VSres"] - df["baseline"]
That's not very beautiful, I know...

Extract histogram values from tensorboard and plot it with matplotlib

I want plot histograms from tensorboard on my own, to publish it. I wrote this extraction function to get the histogram values:
def _load_hist_from_tfboard(path):
event_acc = event_accumulator.EventAccumulator(path)
event_acc.Reload()
vec_dict = {}
for tag in sorted(event_acc.Tags()["distributions"]):
hist_dict = {}
for hist_event in event_acc.Histograms(tag):
hist_dict.update({hist_event.step: (hist_event.histogram_value.bucket_limit,
hist_event.histogram_value.bucket)})
vec_dict[tag] = hist_dict
return vec_dict
The function collects all histograms of a event file. The output of one bucket_limit and bucket is as follows:
[0.0, 1e-12, 0.0005418219168477906, 0.0005960041085325697, 0.0020575678275470133, 0.0022633246103017147, 0.004009617609950718, 0.00441057937094579, 0.005336801038844407, 0.005870481142728848, 0.007813610400972098, 0.008594971441069308, 0.022293142370048362, 0.0245224566070532, 0.026974702267758523, 0.035903328718386605, 0.03949366159022527, 0.043443027749247805, 0.04778733052417259, 0.052566063576589855, 0.057822669934248845, 0.06360493692767373, 0.06996543062044111, 0.07696197368248522, 0.24153964213356663, 0.2656936063469233, 0.29226296698161564, 0.3214892636797772, 0.35363819004775493, 0.38900200905253046, 0.42790220995778355, 0.47069243095356195, 0.5177616740489182, 0.56953784145381, 0.6264916255991911, 0.6891407881591103, 0.7580548669750213, 0.8338603536725235, 0.917246389039776, 1.0089710279437536]
[0.0, 3999936.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 4.0, 0.0, 8.0, 8.0, 0.0, 4.0, 4.0, 0.0, 8.0, 4.0, 0.0, 9.0, 45.0, 50.0, 48.0, 85.0, 100.0, 109.0, 114.0, 15908.0, 74.0, 15856.0, 11908.0, 3973.0, 42.0, 7951679.0]
Can someone help me to interpret these numbers to a histogram.