converting a muti-time formatted string into seconds (pandas) - pandas

I've been searching for a solution to this and the closest I got was here:
Convert time string expressed as <number>[m|h|d|s|w] to seconds in Python
however none of the solutions work because the time format only sometimes contains one unit and is inconsistant throughout the column. e.g.
['4h 30m 24s', '13w 5d', '11w']
when I .apply() this over the entire column it fails. How can I convert all of these rows into seconds? I tried df['time_value'].str.split() but this is a very messy and seemingly inefficient way to do this, there must be a better way?

How about applying this method?
def convert_to_seconds(s):
seconds = 0
seconds_per_unit = {"s": 1, "m": 60, "h": 3600, "d": 86400, "w": 604800}
for part in s.split():
number = int(part[:-1])
unit = part[-1]
seconds += number * seconds_per_unit[unit]
return seconds

You can using stack then with map mapping those to second
s=pd.Series(l)
s=s.str.split(expand=True).stack().to_frame('ALL')
s['v']=s['ALL'].str[:-1].astype(int)
s['t']=s['ALL'].str[-1]
seconds_per_unit = {"s": 1, "m": 60, "h": 3600, "d": 86400, "w": 604800}
(s.t.map(seconds_per_unit)*s.v).unstack()
Out[625]:
0 1 2
0 14400.0 1800.0 24.0
1 7862400.0 432000.0 NaN
2 6652800.0 NaN NaN

Related

convert to datetime based on condition

I want to convert my datetime object into seconds
0 49:36.5
1 50:13.7
2 50:35.8
3 50:37.4
4 50:39.3
...
92 1:00:47.8
93 1:01:07.7
94 1:02:15.3
95 1:05:03.0
96 1:05:29.6
Name: Finish, Length: 97, dtype: object
the problem is that the format changes at index 92 which results in an error: ValueError: expected hh:mm:ss format before .
This error is caused when I try to convert the column to seconds
filt_data["F"] = pd.to_timedelta('00:'+filt_data["Finish"]).dt.total_seconds()
when I do the conversion in two steps it works but results in two different column which I don't know how to merge nor does it seem really efficient:
filt_data["F1"] = pd.to_timedelta('00:'+filt_data["Finish"].loc[0:89]).dt.total_seconds()
filt_data["F2"] = pd.to_timedelta('0'+filt_data["Finish"].loc[90:97]).dt.total_seconds()
the above code does not cause any error and gets the job done but results in two different columns. Any idea how to do this?
Ideally I would like to loop through the column and based on the format i.E. "50:39.3" or "1:00:47.8" add "00:" or "0" to the object.
I would use str.replace:
pd.to_timedelta(df['Finish'].str.replace('^(\d+:\d+\.\d+)', r'0:\1', regex=True))
Or str.count and map:
pd.to_timedelta(df['Finish'].str.count(':').map({1: '0:', 2: ''}).add(df['Finish']))
Output:
0 0 days 00:49:36.500000
1 0 days 00:50:13.700000
2 0 days 00:50:35.800000
3 0 days 00:50:37.400000
4 0 days 00:50:39.300000
92 0 days 01:00:47.800000
93 0 days 01:01:07.700000
94 0 days 01:02:15.300000
95 0 days 01:05:03
96 0 days 01:05:29.600000
Name: Finish, dtype: timedelta64[ns]
Given your data:
import pandas as pd
times = [
"49:36.5",
"50:13.7",
"50:35.8",
"50:37.4",
"50:39.3",
"1:00:47.8",
"1:01:07.7",
"1:02:15.3",
"1:05:03.0",
"1:05:29.6",
]
df = pd.DataFrame({'time': times})
df
You can write a function that you apply on each separate entry in the time column:
def format_time(time):
time = time.split('.')[0]
time = time.split(':')
if(len(time) < 3):
time.insert(0, "0")
return ":".join(time)
df["formatted_time"] = df.time.apply(format_time)
df
Then you could undertake two steps:
Convert column to datetime
Convert column to UNIX timestamp (number of seconds since 1970-01-01)
df["time_datetime"] = pd.to_datetime(df.formatted_time, infer_datetime_format=True)
df["time_seconds"] = (df.time_datetime - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')

SQL code Interpretation "mod" and "CStr"

I've been given an example of SQL code and I need to understand what this is doing before I convert it to another language.
Can someone explain to me in English what this code does please?
=IIF(Fields!Time.Value\60 < 10, "0" + CStr(Fields!Time.Value\60), CStr(Fields!Time.Value\60))
+ ":" +
IIF(Fields!Time.Value mod 60 < 10, "0" + FormatNumber(Fields!Time.Value mod 60,0), FormatNumber(Fields!Time.Value mod 60,0))
Many Many Thanks
It is taking a time (in seconds), and converting it to the format "mm:ss". For what it is worth, this looks more like VB script than SQL. Anyway, Cstr simply converts to string, mod gives the remainder of a division, e.g. 16 mod 10 gives you 6, 26 mod 10 would also give you 6.
The first part is using Fields!Time.Value/60 to get the time in minutes, then when this is number is less than 10, appending a 0 to the start:
| If seconds less than 10 | Append 0 to left of seconds | else just use seconds |
=IIF( Fields!Time.Value\60 < 10 , "0" + CStr(Fields!Time.Value\60), CStr(Fields!Time.Value\60))
The next part basically does the same with the seconds, part, but uses mod to get the number of seconds, for example, 97 seconds needs to be broken down to "01:37", so 97 / 60 is used to get the 1, then because this is less than 10, "0" is prepended to it, then 97 mod 60 is used to get the seconds, which gives 37, since this is over 10 nothing is prepended.

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)

Using textscan() in Octave... How to properly format?

I have a data set on disk that I'm reading line by line, and I would like to convert one of the columns of data into a vector of floats (with a range of 0-23.99999) (for that day).
The data looks something like the following:
2010/01/01,00:00:00.979131, 27.4485, 51.9362, 14.8, 6
2010/01/01,00:00:01.021977, 27.5149, 51.9375, 16.0, 6
2010/01/01,00:00:01.074032, 27.4797, 51.9446, 14.5, 10
2010/01/01,00:00:01.663689, 25.8441,-152.8141, 14.6, 6
2010/01/01,00:00:01.639541, 25.8744,-152.6122, 1.5, 5
2010/01/01,00:00:02.232099, -2.2447, 11.5023, 18.8, 6
2010/01/01,00:00:02.256351, -0.8135, 27.3139, 17.7, 5
2010/01/01,00:00:02.306734, -2.7797, 28.5109, 26.0, 5
2010/01/01,00:00:02.620765, 25.6656,-154.2029, 26.2, 9
2010/01/01,00:00:02.658495, 25.6698,-154.2157, 23.0, 6
2010/01/01,00:00:02.731266, -5.7106, 126.4517, 3.6, 5
2010/01/01,00:00:02.787495, -5.7138, 126.5210, 24.4, 8
2010/01/01,00:00:02.811636, -3.2453, 124.6919, 21.1, 8
column 2 (e.g., 00:00:00.979131) is of interest and I would like to do something like
setenv GNUTERM 'x11';
fid = fopen('myfile.txt', 'r');
m = textscan(fid, '%d%d%d%d/%d%d/%d%d, %d%d:%d%d:%d%f, %f, %f, %f, %d');
mx = m(:, 5); %here, I would expect to grab 14.8, 16.0, etc
my = m(:, 2) / 24.0; %here, all data from timestamp column (00:00:00.979131, for ex)
plot(mx, my);
The issue is that the string I pass to textscan is malformatted for my data.
The formatting of that number is "hrs:minutes:seconds", in military time.
How can I access/convert the values for the vars mx, or my?
Thanks,
jml
The output of textscan is a cell array. If you use the command in your answer:
m = textscan(fid, '%d/%d/%d %d:%d:%f %f %f %f %d', 'delimiter', ',');
Then to get a vertical vector of 14.9, 16.0, 14.5:
MyNinthField = m{9};
MyNinthField =
14.8000
16.0000
14.5000
14.6000
1.5000
18.8000
17.7000
26.0000
26.2000
23.0000
3.6000
24.4000
21.1000
Then, to get the timestamp (seconds since the beginning of the day):
Hours = double(m{4});
Minutes = double(m{5});
Seconds = m{6};
For Seconds double is not needed, because m{6} is already double. However, m{4} and m{5} are both int32.
To get the time of the day in seconds, all you need is:
TimeOfDayInSeconds = 3600*Hours+60*Minutes+Seconds;
TimeOfDayInSeconds =
0.97913
1.02198
1.07403
1.66369
1.63954
2.23210
2.25635
2.30673
2.62077
2.65849
2.73127
2.78749
2.81164
If you didn't do type conversion from int32 to double, Octave would truncate the values to integers. MATLAB however does not even allow the sum between integer and double arrays.
Hope this helps.
Looks like you can do something like:
m = textscan(fid, '%d/%d/%d %d:%d:%f %f %f %f %d', 'delimiter', ',');
...which, although it works, it puts the time format var that I wanted into multiple portions of a var. Better than nothing. If anyone has suggestions on how to concatenate those variables after, I would love to read more. Thanks!