Time column interval filter - pandas

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks

I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

Related

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

Drop Duplicates based on Nearest Datetime condition

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Sorting date values ​in a dataframe doesn't work

I have the column 'Created At' in this form:
The date is in this format: '%d/%m/%Y' -> day, month, year
obj = {'Created At': ['01/01/2017', '01/02/2017', '02/01/2017',
'02/02/2017',
'03/01/2017', '03/02/2017','04/01/2017' ],
'Text': [1, 70,14,17,84,76,32]}
df = pd.DataFrame(data=obj)
I did it, but dosen't work:
df.sort_values(by='Created At', inplace=True)
It seems that it sorts only the days and disregards the month. What do I do?
It does sort it properly: your dates are strings here. Strings are sorted lexicographically. So that means that only if the first character is the same, it will look at the second character, etc.
You therefore might want to convert the column first to datetime objects:
df['Created At'] = pd.to_datetime(df['Created At'], format='%d/%m/%Y')
then we can sort the dataframe, and obtain:
>>> df.sort_values(by='Created At', inplace=True)
>>> df
Created At Text
0 2017-01-01 1
2 2017-01-02 14
4 2017-01-03 84
6 2017-01-04 32
1 2017-02-01 70
3 2017-02-02 17
5 2017-02-03 76

How can a pandas dataframe with a TimedeltaIndex by grouped by nearest whole day?

I've got a pandas DataFrame with an index of pd.TimeDeltas some of which are fractions of days. I'd like to use df.groupby to group the rows by whole days (ignoring the fractions of days) so that I can calculate the mean.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
data = [[1,2,3], [2,3,4], [3,4,5], [1,2,3], [2,3,4], [3,4,5]]
idx = [pd.Timedelta('1.2 days'), pd.Timedelta('1.2 days'), pd.Timedelta('3.8 days'), pd.Timedelta('3.8 days'), pd.Timedelta('4.2 days'), pd.Timedelta('4.2 days')]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df.index = idx
df
Out:
a b c
1 days 04:48:00 1 2 3
1 days 04:48:00 2 3 4
3 days 19:12:00 3 4 5
3 days 19:12:00 1 2 3
4 days 04:48:00 2 3 4
4 days 04:48:00 3 4 5
The line below produces the desired a result however it creates extra rows for each day so there are rows full of NaNs which I subsequently remove with df.dropna(). Is there a better approach to doing this?
df.groupby(pd.Grouper(freq='D')).aggregate(np.mean).dropna()
Your approach is fine, or you can just group by df.index.days as below:
In [196]: df.groupby(df.index.days).mean()
Out[196]:
a b c
1 1.5 2.5 3.5
3 2.0 3.0 4.0
4 2.5 3.5 4.5
The difference in the two methods is where things get grouped on the margins. In yours, something at 2 days, 02:00:00 would get grouped with the 1-day rows since pd.Grouper will start with the first example, whereas in mine, it will get a separate row since it treats midnight as the start of a new group.