Loading a time series from CSV into a DataFrame - dataframe

Is it possible to create a Daru DataFrame from a CSV in which the first column is a series of dates?
Take the following CSV, for instance:
time,min,max
2018-01-01,101,103
2018-01-02,102,105
2018-01-03,103,200
2018-01-04,104,109
2018-01-05,105,110
If loaded with Daru::DataFrame.from_csv it will create a 5x3 DataFrame with a 0-based numerical index, instead of a 5x2 DataFrame with a DateTimeIndex.
Is there a way to instruct Daru to use the first vector as a DateTimeIndex index?

df = Daru::DataFrame.from_csv("df.csv")
df.set_index "time"
df.index = Daru::DateTimeIndex.new(df.index)
df
<Daru::DataFrame(5x2)>
min max
2018-01-01 101 103
2018-01-02 102 105
2018-01-03 103 200
2018-01-04 104 109
2018-01-05 105 110

Related

Find index range of a sequence in dataframe column

I have a timeseries:
Sales
2018-01-01 66.65
2018-01-02 66.68
2018-01-03 65.87
2018-01-04 66.79
2018-01-05 67.97
2018-01-06 96.92
2018-01-07 96.90
2018-01-08 96.90
2018-01-09 96.38
2018-01-10 95.57
Given an arbitrary sequence of values, let's say [66.79,67.97,96.92,96.90], how could I obtain the corresponding indices, for example: [2018-01-04, 2018-01-05,2018-01-06,2018-01-07]?
Use pandas.Series.isin to filter the column Sales then pandas.DataFrame.index to return the row labels (aka index, dates in your df) and finally pandas.Series.to_list to build a list :
vals = [66.79,67.97,96.92,96.90]
result = df[df['Sales'].isin(vals)].index.to_list()
# Output :
print(result)
['2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08']

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

How to sum up a selected range of rows via a condition?

I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()

Vectorize a function for a GroupBy Pandas Dataframe

I have a Pandas dataframe sorted by a datetime column. Several rows will have the same datetime, but the "report type" column value is different. I need to select just one of those rows based on a list of preferred report types. The list is in order of preference. So, if one of those rows has the first element in the list, then that is the row chosen to be appended to a new dataframe.
I've tried a GroupBy and the ever so slow Python for loops to process each group to find the preferred report type and append that row to a new dataframe. I thought about the numpy vectorize(), but I don't know how to incorporate the group by in it. I really don't know much about dataframes but am learning. Any ideas on how to make it faster? Can I incorporate the group by?
The example dataframe
OBSERVATIONTIME REPTYPE CIGFT
2000-01-01 00:00:00 AUTO 73300
2000-01-01 00:00:00 FM-15 25000
2000-01-01 00:00:00 FM-12 3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16 600
2000-01-01 01:00:00 FM-15 5000
2000-01-01 01:00:00 AUTO 5000
2000-01-01 02:00:00 FM-12 12000
2000-01-01 02:00:00 FM-15 15000
2000-01-01 02:00:00 FM-16 8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16 5000
2000-01-01 06:00:00 AUTO 15000
2000-01-01 06:00:00 FM-12 12500
2000-01-01 06:00:00 FM-16 12000
2000-01-01 07:00:00 FM-15 20000
#################################################
# The function to loop through and find the row
################################################
def select_the_one_ob(df):
''' select the preferred observation '''
tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12',
'SY-MT', 'SY-SA']
grouped = df.groupby("OBSERVATIONTIME", as_index=False)
for name, group in grouped:
a_group_df = pd.DataFrame(grouped.get_group(name))
for reptype in preferred_order:
preferred_found = False
for i in a_group_df.index.values:
if a_group_df.loc[i, 'REPTYPE'] == reptype:
tophour_df =
tophour_df.append(a_group_df.loc[i].transpose())
preferred_found = True
break
if preferred_found:
break
del a_group_df
return tophour_df
################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
data_df = platformcig.df
data_df = select_the_one_ob(data_df)
With the complete dataset of 300,000 rows, the function takes over 4 hours.
I need it to be much faster. Can I incorporate the group by in numpy vectorize()?
You can avoid to use groupby. One way could be to categorize your column 'REPTYPE' with pd.Categorical and then sort_values and drop_duplicates such as:
def select_the_one_ob(df):
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
.drop_duplicates(subset='OBSERVATIONTIME', keep='first'))
and you get with your example:
OBSERVATIONTIME REPTYPE CIGFT
1 2000-01-01 00:00:00 FM-15 25000
5 2000-01-01 01:00:00 FM-15 5000
8 2000-01-01 02:00:00 FM-15 15000
10 2000-01-01 03:00:00 SAO 700
11 2000-01-01 04:00:00 SAO 3000
12 2000-01-01 05:00:00 FM-16 5000
13 2000-01-01 06:00:00 AUTO 15000
16 2000-01-01 07:00:00 FM-15 20000
Found that creating a separate dataframe of the same shape populated with each hour of the observation time, I could use use pandas dataframe merge() and after the first pass use pandas dataframe combine_first(). This took only minutes instead of hours.
def select_the_one_ob(df):
''' select the preferred observation
Parameters:
df (Pandas Object), a Pandas dataframe
Returns Pandas Dataframe
'''
dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
dshell = pd.DataFrame(data = dshelldict)
dshell['YEAR'] = dshell['DateTime'].dt.year
dshell['MONTH'] = dshell['DateTime'].dt.month
dshell['DAY'] = dshell['DateTime'].dt.day
dshell['HOUR'] = dshell['DateTime'].dt.hour
dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
df = df.set_index(['YEAR','MONTH','DAY','HOUR'])
#tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
reptype_list = list(df.REPTYPE.unique())
# remove the preferred report types from the unique ones
for rep in preferred_order:
if rep in reptype_list:
reptype_list.remove(rep)
# If there are any unique report types left, append them to the preferred list
if len(reptype_list) > 0:
preferred_order = preferred_order + reptype_list
## i is flag to make sure a report type is used to transfer columns to new DataFrame
## (Merge has to happen before combine first)
first_pass = True
for reptype in preferred_order:
if first_pass:
## if there is data in dataframe
if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
first_pass = False
# Merge shell with first df with data, the dataframe is sorted by original
# obstime and drop any dup's keeping first aka. first report chronologically
tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
else:
# combine_first takes the original dataframe and fills any nan values with data
# of another identical shape dataframe
# ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))
tophour_df = tophour_df.reset_index()
return tophour_df

Re-sampling and interpolating data using pandas from a given date column to a different date column

I can mostly find conversions and down/upsampling from e.g. daily date range to monthly date ranges or from monthly/yearly date ranges to daily date ranges using pandas.
Is there a way that given data for some arbitrary days one can map them to different days using interpolation/extrapolation?
Index.union, reindex, and interpolate
MCVE
Create toy data. Three rows every other day.
tidx = pd.date_range('2018-01-01', periods=3, freq='2D')
df = pd.DataFrame(dict(A=[1, 3, 5]), tidx)
df
A
2018-01-01 1
2018-01-03 3
2018-01-05 5
New index for those days in between
other_tidx = pd.date_range(tidx.min(), tidx.max()).difference(tidx)
Solution
Create a new index that is the union of the old index and the new index
union_idx = other_tidx.union(df.index)
When we reindex with this we get
df.reindex(union_idx)
A
2018-01-01 1.0
2018-01-02 NaN
2018-01-03 3.0
2018-01-04 NaN
2018-01-05 5.0
We see the gaps we expected. Now we can use interpolate. But we need to use the argument method='index' to ensure we interpolate relative to the size of the gaps in the index.
df.reindex(union_idx).interpolate('index')
A
2018-01-01 1.0
2018-01-02 2.0
2018-01-03 3.0
2018-01-04 4.0
2018-01-05 5.0
And now those gaps are filled.
We can reindex again to reduce to just the other index values
df.reindex(union_idx).interpolate('index').reindex(other_tidx)
A
2018-01-02 2.0
2018-01-04 4.0