How to convert duration strings to seconds? - pandas

I have such a column in a pandas dataframe:
duration
1 day 22:12:15.778543
2 days 10:09:07.118723
00:18:23.985112
I would like to convert this duration to seconds.
How can I do this? I am not sure if this is possible because of the special string format I got (1 day, 2 days etc.)?

Use to_timedelta with Series.dt.total_seconds:
df['s'] = pd.to_timedelta(df['duration']).dt.total_seconds()
print (df)
duration s
0 1 day 22:12:15.778543 166335.778543
1 2 days 10:09:07.118723 209347.118723
2 00:18:23.985112 1103.985112

Related

Converting duration in varchar to number type and minutes

I'm struggling with this.
I have a column in Snowflake called DURATION, it is VARCHAR type.
The values include basically number in days, hours, minutes, seconds. The value could include either just the number with one unit of time (day or hour or minute or second) such as 3 hours or 14 minutes or 3 seconds or it could include the combination of either all units of time or a few such as 1 day 3 hours 35 minutes or 1 hour 9 minutes or 45 minutes 1 second.
The value could also be blank or invalid such as text or it could be indicating day, hour or minute but without a number (see the last 3 rows in the table below).
I would greatly appreciate it if you guys could help me with the following:
in SNOWFLAKE, convert all valid values to number type and normalize them to minutes (e.g. the resulted value for 7 Hours and 13 Minutes would be 433).
Thanks a lot, guys!
DURATION
1 Second
10 Seconds
1 Minute
3 Minutes
20 Minutes
1 Hour
2 Hours
7 Hours 13 Minutes
1 Hour 1 Minute
1 Day
1 Day 1 Hour
1 Day 1 Hour 1 Minute
1 Day 10 Hours
2 Days 1 Hour
3 Days 9 Hours
1 Day 3 Hours 45 Minutes
Duration (invalid)
Days
Day Minute
Minutes
I tried many things using regex_substr, try_to_number, coalesce functions in CASE statements but I'm getting either 0s or NULL for all values. Very frustrating
I think you would want to use STRTOK_TO_ARRAY in a CTE subquery or put into a temp table. Then you could use ARRAY_POSITION to find the labels and the index one less than the label should be the value. Those values could be put into separate columns with a case for each label pulling the found values. The case statements could be computed columns if you insert the results of the first query into a table. From there you can concatenate colons and cast to a time type and use datediff, or do the arithmetic to calculate the minutes.

How to get the day of the week from a time stamp?

I am using Hive .14 to analyze the following input data in timestamp format (# is irrelevant to the explanation):
#
Datetime
1
2022-03-01 00:13:08
2
2022-03-31 23:52:24
3
2022-02-28 23:32:40
and I want to get what day of the week in which each data took place (either by a number representing the day from 0 to 6, or the day itself) in, similar to the next format:
#
Day of the week
1
Tuesday or 2
2
Thursday or 4
3
Monday or 1
I have tried to use the unixtime command to transform the timestamp into an integer
like this:
select cast(from_unixtime(unix_timestamp(datetime,'yyyy-MM-dd'),'yyyyMMdd') as int) as dayint from yellowtaxi;
To later use the from_unixtime(dayint,u) query to get the day of the week in which it happened, however, this results in all the days from all the rows being equal to 20220301 and to all the days being equal to 7 when using from_unixtime(dayint,u).
What am I doing wrong, or is there an easier way to do it?
I have already tried the day_format() and the dayofweek() queries, but none of them seem to be available in my hive version.

Pandas resample by integration over time with non equidistant data

I have a DataFrame with a Datetimeindex with non equidistant timestamps. I want to get the mean for each hour. But by using resample.mean(), the time distance between the timestamps is not considered.
How can I resample a DataFrame with a Datetimeindex to integrate the values in a column?
given the following data:
time
data
00:15
5
00:55
1
00:56
1
00:57
1
resample.mean() would give 4, but the value 1 was only set for 3 from 60 minutes.

Count rows in a dataframe where date is in the past 7 days

I have this dataframe with a "date" column in it. I want to count the rows where the date is in the past 7 days. What's the best way to do this? I feel like using an If and a counter isn't very pandas-esque.
Also, I'm importing the data from a SQL db. Should I just load it already filtered with a query? What's the most efficient way?
Consider your dataframe is something like that:
df = pd.DataFrame({'date': ['2021-12-03', '2021-12-02', '2021-12-01', '2021-11-30'], 'data': [1, 2, 3, 4]})
date data
0 2021-12-03 1
1 2021-12-02 2
2 2021-12-01 3
3 2021-11-30 4
if you want to filter the data between dates 2021-11-30 and 2021-12-02, you can use the following command:
df_filtered = df.set_index('date').loc['2021-12-02':'2021-11-30'].reset_index()
date data
0 2021-12-02 2
1 2021-12-01 3
2 2021-11-30 4
In the first step, you set the date to the index and after that use .loc method to filter your desired dates. In the final step, you can count the number of rows by using the len(df_filtered)
My suggestion:
First, calculate the interval datetimes: today and past 7 days.
import datetime
today = datetime.date.today()
past7 = today - datetime.timedelta(days=7)
Use them to filter your dataframe:
df_filtered = df[(df['date'] >= past7) & (df['date'] <= today)]
Get the df_filtered length:
print(len(df_filtered))
try:
len(df[df['date'] > datetime.date.today() - pd.to_timedelta("7day")])

Is there a way to fix or bypass weird time formats in a specific column in a dataframe?

I am working with a SLURM dataset in Pandas that has time formats like so in the 'Elapsed' column:
00:00:00
00:26:51
However, sometimes there are sections that are greater than 24 hours, and it displays it like so:
1-00:02:00
3-01:25:02
I want to find the mean of the entire column but it mishandles the to_timedelta conversion on the entries with entries above 24 hours like shown above. One example is this:
Before to_timedelta: 3-01:25:02
after to_timedelta: -13 days +10:34:58
I cannot simply convert the column into a new format because when entry is not greater than 24 hours, preceding zeros do not exist, ex: 0-20:00:00
This method would be easiest I believe if there is a way however.
Is there a way to fix this conversion or any other ideas on approaching this?
One way to go around is replacing - with days:
pd.to_timedelta(df['time'].str.replace('-','days'))
Output (for 4 lines above):
0 0 days 00:00:00
1 0 days 00:26:51
2 1 days 00:02:00
3 3 days 01:25:02
Name: time, dtype: timedelta64[ns]