I have a dataframe timings as follows:
start_ms end_ms
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z
I'm trying to calculate the time difference between the start_ms and end_ms of each row in milliseconds, i.e. I wish to get the result
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
I can convert the timestamp to datetime column by column, but I'm not sure if the order of the values are retained.
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
Is it possible to convert the timestamps to datetime inside timings, and add the time difference column? Do I even need to convert to get the difference? How do I calculate the time difference in milliseconds?
Subtract columns by Series.sub and then use Series.dt.components:
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
timings['diff'] = end_ms_time.sub(start_ms_time).dt.components.milliseconds
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
Or Series.dt.total_seconds with multiple by 1000 and cast to integers:
timings['diff'] = end_ms_time.sub(start_ms_time).dt.total_seconds().mul(1000).astype(int)
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
Related
I want to convert my datetime object into seconds
0 49:36.5
1 50:13.7
2 50:35.8
3 50:37.4
4 50:39.3
...
92 1:00:47.8
93 1:01:07.7
94 1:02:15.3
95 1:05:03.0
96 1:05:29.6
Name: Finish, Length: 97, dtype: object
the problem is that the format changes at index 92 which results in an error: ValueError: expected hh:mm:ss format before .
This error is caused when I try to convert the column to seconds
filt_data["F"] = pd.to_timedelta('00:'+filt_data["Finish"]).dt.total_seconds()
when I do the conversion in two steps it works but results in two different column which I don't know how to merge nor does it seem really efficient:
filt_data["F1"] = pd.to_timedelta('00:'+filt_data["Finish"].loc[0:89]).dt.total_seconds()
filt_data["F2"] = pd.to_timedelta('0'+filt_data["Finish"].loc[90:97]).dt.total_seconds()
the above code does not cause any error and gets the job done but results in two different columns. Any idea how to do this?
Ideally I would like to loop through the column and based on the format i.E. "50:39.3" or "1:00:47.8" add "00:" or "0" to the object.
I would use str.replace:
pd.to_timedelta(df['Finish'].str.replace('^(\d+:\d+\.\d+)', r'0:\1', regex=True))
Or str.count and map:
pd.to_timedelta(df['Finish'].str.count(':').map({1: '0:', 2: ''}).add(df['Finish']))
Output:
0 0 days 00:49:36.500000
1 0 days 00:50:13.700000
2 0 days 00:50:35.800000
3 0 days 00:50:37.400000
4 0 days 00:50:39.300000
92 0 days 01:00:47.800000
93 0 days 01:01:07.700000
94 0 days 01:02:15.300000
95 0 days 01:05:03
96 0 days 01:05:29.600000
Name: Finish, dtype: timedelta64[ns]
Given your data:
import pandas as pd
times = [
"49:36.5",
"50:13.7",
"50:35.8",
"50:37.4",
"50:39.3",
"1:00:47.8",
"1:01:07.7",
"1:02:15.3",
"1:05:03.0",
"1:05:29.6",
]
df = pd.DataFrame({'time': times})
df
You can write a function that you apply on each separate entry in the time column:
def format_time(time):
time = time.split('.')[0]
time = time.split(':')
if(len(time) < 3):
time.insert(0, "0")
return ":".join(time)
df["formatted_time"] = df.time.apply(format_time)
df
Then you could undertake two steps:
Convert column to datetime
Convert column to UNIX timestamp (number of seconds since 1970-01-01)
df["time_datetime"] = pd.to_datetime(df.formatted_time, infer_datetime_format=True)
df["time_seconds"] = (df.time_datetime - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df
I am trying to calculate the mean of Interval without selling of a product.
I thought that a good way to get this is:
Count (Days without selling) / Count (Intervals of consecutive days without selling)
Units Sold
0 1
1 4
2 0
3 0
4 0
5 7
6 0
7 0
8 0
9 0
10 1
11 0
In this example I had:
8 days without selling
3 Intervals of consecutive days without selling
So, 8/3 = 2.7 should be my result.
Counting days with No units sold I am using this:
x['Units Sold'] == 0).sum()
However, I don't figured out a good approach to calculate 'Intervals of consecutive days without selling' in a efficient way (considering I will run on multiple products)
Another approach using nunique
s = df["Units Sold"].eq(0)
d = s.sum()
i = s[s].index.to_series().diff().ne(1).cumsum().nunique()
final = d/i # 2.6666666666666665
Using eq, cumsum and diff
First we use eq(0) and sum, to count the amount of days where nothing was sold.
Then we get the cumsum of these days and check wether or not there's a difference between the rows. If this difference is 0, that means there was an interval.
days = x['Units Sold'].eq(0).sum()
intervals = x['Units Sold'].eq(0).cumsum().diff().eq(0)
mask = x['Units Sold'].shift(-1).eq(0)
days / (intervals & mask).sum()
Output
2.6666666666666665
You already knew how to get sum of count of 0, so try this to find number of consective group of 0
s = df['Units Sold'].eq(0)
(s & ~s.shift(fill_value=False)).sum()
Out[567]: 3
You can use:
df.eq(0).sum()/((df.eq(0)&df.shift().ne(0)).sum())
Output:
Units Solds 2.666667
dtype: float64
The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!
I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)
I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0
I have question to SQL Server 2008 R2.
I have the following data: SQL Query and Output
I am trying to calculate the difference in time between consecutive records where wartosc = 1 and wartosc = 0. I then want to sum this difference for all records.
For instance in my example:
Row 1 has wartosc = 1 and czas = 2016-07-14 22:01:36 and
Row 2 has wartosc = 0 and czas = 2016-07-14 22:02:06
I would like to find the time difference between Row 1 and Row 2. I would like to do this for all records and sum the resulting time differences.
I add rows when value is changing, now I must subtract time where 0 is to 1
for example:
(2016-07-14 22:02:06 - 2016-07-14 22:01:36) + (2016-07-14 22:04:56
- 2016-07-14 22:02:11) + (2016-07-14 22:17:01 - 2016-07-14 22:14:56) ...