Pandas subsampling - pandas

I have some event data that is measured in time, so the data format looks like
Time(s) Pressure Humidity
0 10 5
0 9.9 5.1
0 10.1 5
1 10 4.9
2 11 6
Here the first column is Time elapsed since the start of the experiment, in seconds. The other two cols are some observations. A row is created when certain conditions are true, these conditions are beyond the scope of the discussion here. Each set of 3 numbers separated by a semi colon is a row of data. Since the lowest granularity of resolution in time here is only seconds, you could have two rows with the same timestamp but but will different observations. Basically these were two distinct events that time could not distinguish.
Now my problem is to roll up the data series, by subsampling it say every 10 or 100 seconds, or 1000 seconds. So I want a skimmed data series from the original higher granularity data series. There are a few ways to decide which row you would use, for instance say you are subsampling at every 10 seconds, when 10 seconds elapse, you could have multiple rows with the time stamp of 10 seconds. You could either take
1) first row
2) mean of all rows with the same timestamp of 10
3) some other technique
I am looking to do this in pandas, any ideas or way to start would be very appreciated. Thanks.

Here is a simple example that shows how to perform
the operations requested with pandas.
One uses data binning to group samples and
resample data.
import pandas as pd
# Creation of the dataframe
df = pd.DataFrame({\
'Time(s)':[0 ,0 ,0 ,1 ,2],\
'Pressure':[10, 9.9, 10.1, 10, 11],\
'Humidity':[5 ,5.1 ,5 ,4.9 ,6]})
# Select time increment
delta_t = 1
timeCol = 'Time(s)'
# Creation of the time sampling
v = xrange(df[timeCol].min()-delta_t,df[timeCol].max()+delta_t,delta_t)
# Pandas magic instructions with cut and groupby
df_binned = df.groupby(pd.cut(df[timeCol],v))
# Display the first element
dfFirst = df_binned.head(1)
# Evaluate the mean of each group
dfMean = df_binned.mean()
# Evaluate the median of each group
dfMedian = df_binned.median()
# Find the max of each group
dfMax = df_binned.max()
# Find the min of each group
dfMin = df_binned.min()
Result will look like this for dfFirst
Humidity Pressure Time(s)
Time(s)
(-1, 0] 0 5.0 10 0
(0, 1] 3 4.9 10 1
(1, 2] 4 6.0 11 2
Result will look like this for dfMean
Humidity Pressure Time(s)
Time(s)
(-1, 0] 5.033333 10 0
(0, 1] 4.900000 10 1
(1, 2] 6.000000 11 2

Related

Panda's dataframe not throwing error for rows containing lesser fields

I am facing one issue when reading rows containing lesser fields my dataset looks like below.
"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"
and I am running below command to read the dataset.
tabdf = pd.read_csv('test_table.csv',sep='~',header = None)
But its not throwing any error though its supposed too.
The version we are using
pip show pandas
Name: pandas
Version: 1.0.1
My question is how to make the process failed if we will get incorrect data structure.
You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.
Read full rows into a dataframe
df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)
0
0 "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1 "Source1"~"schema2"~"table2"
Count number of separators
sep_count = df[0].str.count('~')
sep_count
0 11
1 2
Maybe just terminate the process if bad (short) rows
If the number of unique values are not 1.
sep_count.nunique()
2
Or just read the good rows
good_rows = sep_count.eq(11) # if you know what separator count should be. Or ...
good_rows = sep_count.eq(sep_count.max()) # if you know you have at least 1 good row
df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows]
print(df)
Result
0 1 2 3 4 5 6 7 8 9 10 11
0 Source1 schema1 table1 modifiedon timestamp STAGE 15.0 NaN NaN False False True

Find maximum between -300 seconds to -30 seconds ago?

To find the maximum over the last 300 seconds:
import pandas as pd
# 16 to 17 minutes of time-series data.
df = pd.DataFrame(range(10000))
df.index = pd.date_range(1, 1000000000000, 10000)
# maximum over last 300 seconds. (outputs 9999)
df[0].rolling('300s').max().tail(1)
How can I exclude the most recent 30s from the rolling calculation? I went the max between -300s and -30s.
So, instead of 9999 being outputted by the above, I want something like 9700 (thereabouts) to be displayed.
You can compute the rolling max for the last 271 seconds (271 instead of 270 if you need that 300th second included), then shift the results by 30 seconds, and merge them with the original dataframe. Since in your example the index is at the sub-second level, you will need to utilize merge_asof to find the desired matches (you can use the direction parameter of that function to select non-exact matches).
import pandas as pd
# 16 minutes and 40 seconds of time-series data.
df = pd.DataFrame(range(10_000))
df.index = pd.date_range(1, 1_000_000_000_000, 10_000)
roll_max = df[0].rolling('271s').max().shift(30, freq='s').rename('roll_max')
res = pd.merge_asof(df, roll_max, left_index=True, right_index=True)
print(res.tail(1))
# 0 roll_max
# 1970-01-01 00:16:40 9999 9699.0

Add column of .75 quantile based off groupby

I have df with index as date and also column called scores. Now I want to maintain the df as it is but add column which gives the 0.7 quantile of scores for that day. Method of quantile would need to be midpoint and also be rounded to nearest whole number.
I've outlined one approach you could take, below.
Note that to round a value to the nearest whole number you should use Python's built-in round() function. See round() in the Python documentation for details.
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(748)
# initialize base example dataframe
df = pd.DataFrame({"date":np.arange(10),
"score":np.random.uniform(size=10)})
duplicate_dates = np.random.choice(df.index, 5)
df_dup = pd.DataFrame({"date":np.random.choice(df.index, 5),
"score":np.random.uniform(size=5)})
# finish compiling example data
df = df.append(df_dup, ignore_index=True)
# calculate 0.7 quantile result with specified parameters
result = df.groupby("date").quantile(q=0.7, axis=0, interpolation='midpoint')
# print resulting dataframe
# contains one unique 0.7 quantile value per date
print(result)
"""
0.7 score
date
0 0.585087
1 0.476404
2 0.426252
3 0.363376
4 0.165013
5 0.927199
6 0.575510
7 0.576636
8 0.831572
9 0.932183
"""
# to apply the resulting quantile information to
# a new column in our original dataframe `df`
# we can apply a dictionary to our "date" column
# create dictionary
mapping = result.to_dict()["score"]
# apply to `df` to produce desired new column
df["quantile_0.7"] = [mapping[x] for x in df["date"]]
print(df)
"""
date score quantile_0.7
0 0 0.920895 0.585087
1 1 0.476404 0.476404
2 2 0.380771 0.426252
3 3 0.363376 0.363376
4 4 0.165013 0.165013
5 5 0.927199 0.927199
6 6 0.340008 0.575510
7 7 0.695818 0.576636
8 8 0.831572 0.831572
9 9 0.932183 0.932183
10 7 0.457455 0.576636
11 6 0.650666 0.575510
12 6 0.500353 0.575510
13 0 0.249280 0.585087
14 2 0.471733 0.426252
"""

Double grouping data by bins AND time with pandas

I am trying to bin values from a timeseries (hourly and subhourly temperature values) within a time window.
That is, from original hourly values, I'd like to extract binned values on a daily, weekly or monthly basis.
I have tried to combine groupby+TimeGrouper(" ") with pd.cut, with poor results.
I have came across a nice function from this tutorial, which suggests to map the data (associating to each value with its mapped range on the next column) and then grouping according to that.
def map_bin(x, bins):
kwargs = {}
if x == max(bins):
kwargs['right'] = True
bin = bins[np.digitize([x], bins, **kwargs)[0]]
bin_lower = bins[np.digitize([x], bins, **kwargs)[0]-1]
return '[{0}-{1}]'.format(bin_lower, bin)
df['Binned'] = df['temp'].apply(map_bin, bins=freq_bins)
However, applying this function results in an IndexError: index n is out of bounds for axis 0 with size n.
Ideally, I'd like make this work and apply it to achieve a double grouping at the same time: one by bins and one by timegrouper.
Update:
It appears that my earlier attempt was causing problems because of the double-indexed columns. I have simplified to something that seems to work much better.
import pandas as pd
import numpy as np
xaxis = np.linspace(0,50)
temps = pd.Series(data=xaxis,name='temps')
times = pd.date_range(start='2015-07-15',periods=50,freq='6H')
temps.index = times
bins = [0,10,20,30,40,50]
temps.resample('W').agg(lambda series:pd.value_counts(pd.cut(series,bins),sort=False)).unstack()
This outputs:
(0, 10] (10, 20] (20, 30] (30, 40] (40, 50]
2015-07-19 9 10 0 0 0
2015-07-26 0 0 10 10 8
2015-08-02 0 0 0 0 2

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)