convert to datetime based on condition - pandas

I want to convert my datetime object into seconds
0 49:36.5
1 50:13.7
2 50:35.8
3 50:37.4
4 50:39.3
...
92 1:00:47.8
93 1:01:07.7
94 1:02:15.3
95 1:05:03.0
96 1:05:29.6
Name: Finish, Length: 97, dtype: object
the problem is that the format changes at index 92 which results in an error: ValueError: expected hh:mm:ss format before .
This error is caused when I try to convert the column to seconds
filt_data["F"] = pd.to_timedelta('00:'+filt_data["Finish"]).dt.total_seconds()
when I do the conversion in two steps it works but results in two different column which I don't know how to merge nor does it seem really efficient:
filt_data["F1"] = pd.to_timedelta('00:'+filt_data["Finish"].loc[0:89]).dt.total_seconds()
filt_data["F2"] = pd.to_timedelta('0'+filt_data["Finish"].loc[90:97]).dt.total_seconds()
the above code does not cause any error and gets the job done but results in two different columns. Any idea how to do this?
Ideally I would like to loop through the column and based on the format i.E. "50:39.3" or "1:00:47.8" add "00:" or "0" to the object.

I would use str.replace:
pd.to_timedelta(df['Finish'].str.replace('^(\d+:\d+\.\d+)', r'0:\1', regex=True))
Or str.count and map:
pd.to_timedelta(df['Finish'].str.count(':').map({1: '0:', 2: ''}).add(df['Finish']))
Output:
0 0 days 00:49:36.500000
1 0 days 00:50:13.700000
2 0 days 00:50:35.800000
3 0 days 00:50:37.400000
4 0 days 00:50:39.300000
92 0 days 01:00:47.800000
93 0 days 01:01:07.700000
94 0 days 01:02:15.300000
95 0 days 01:05:03
96 0 days 01:05:29.600000
Name: Finish, dtype: timedelta64[ns]

Given your data:
import pandas as pd
times = [
"49:36.5",
"50:13.7",
"50:35.8",
"50:37.4",
"50:39.3",
"1:00:47.8",
"1:01:07.7",
"1:02:15.3",
"1:05:03.0",
"1:05:29.6",
]
df = pd.DataFrame({'time': times})
df
You can write a function that you apply on each separate entry in the time column:
def format_time(time):
time = time.split('.')[0]
time = time.split(':')
if(len(time) < 3):
time.insert(0, "0")
return ":".join(time)
df["formatted_time"] = df.time.apply(format_time)
df
Then you could undertake two steps:
Convert column to datetime
Convert column to UNIX timestamp (number of seconds since 1970-01-01)
df["time_datetime"] = pd.to_datetime(df.formatted_time, infer_datetime_format=True)
df["time_seconds"] = (df.time_datetime - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
df

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

'Timestamp' object has no attribute 'dt' pandas

I have a dataset of 100,000 rows and 15 column in a 10mb csv.
the column I am working on is a : Date/Time column in a string format
source code
import pandas as pd
import datetime as dt
trupl = pd.DataFrame({'Time/Date' : ['12/1/2021 2:09','22/4/2021 21:09','22/6/2021 9:09']})
trupl['Time/Date'] = pd.to_datetime(trupl['Time/Date'])
print(trupl)
Output
Time/Date
0 2021-12-02 02:09:00
1 2021-04-22 21:09:00
2 2021-06-22 09:09:00
What I need to do is a bit confusing but I'll try to make it simple :
if the time of the date is between 12 am and 8 am ; subtract one day from the Time/Date and put the new timestamp in a new column.
if not, put it as it is.
Expected output
Time/Date Date_adjusted
0 12/2/2021 2:09 12/1/2021 2:09
1 22/4/2021 21:09 22/4/2021 21:09
2 22/6/2021 9:09 22/6/2021 9:09
I tried the below code :
trupl['Date_adjusted'] = trupl['Time/Date'].map(lambda x:x- dt.timedelta(days=1) if x >= dt.time(0,0,0) and x < dt.time(8,0,0) else x)
i get a TypeError: '>=' not supported between 'Timestamp' and 'datetime.time'
and when applying dt.time to x , i get an error " Timestamp" object has no attribute 'dt'
so how can i convert x to time in order to compare it ? or there is a better workaround ?
I searched a lot for a fix but I couldn't find a similar case.
Try:
trupl['Date_adjusted'] = trupl['Time/Date'].map(lambda x: x - dt.timedelta(days=1) if (x.hour >= 0 and x.hour < 8) else x)

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Calculate time difference in milliseconds using Pandas

I have a dataframe timings as follows:
start_ms end_ms
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z
I'm trying to calculate the time difference between the start_ms and end_ms of each row in milliseconds, i.e. I wish to get the result
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
I can convert the timestamp to datetime column by column, but I'm not sure if the order of the values are retained.
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
Is it possible to convert the timestamps to datetime inside timings, and add the time difference column? Do I even need to convert to get the difference? How do I calculate the time difference in milliseconds?
Subtract columns by Series.sub and then use Series.dt.components:
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
timings['diff'] = end_ms_time.sub(start_ms_time).dt.components.milliseconds
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
Or Series.dt.total_seconds with multiple by 1000 and cast to integers:
timings['diff'] = end_ms_time.sub(start_ms_time).dt.total_seconds().mul(1000).astype(int)
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18

pd.datetime is failing to convert to date

I have a data frame, which has a column 'Date', it is a string type, and as I want to use the column 'Date' as index, first I want to convert it to datetime, so I did:
data['Date'] = pd.to_datetime(data['Date'])
then I did,
data = data.set_index('Date')
but when I tried to do
data = data.loc['01/06/2006':'09/06/2006',]
the slicing is not accomplished, there is no Error but the slicing doesn't occur, I tried with iloc
data = data.iloc['01/06/2006':'09/06/2006',]
and the error message is the following:
TypeError: cannot do slice indexing on <class `'pandas.tseries.index.DatetimeIndex'> with these indexers [01/06/2006] of <type 'str'>`
So I come to the conclusion that the pd.to_datetime didn't work, even though no Error was raised?
Can anybody clarify what is going on? Thanks in advance
It seems you need change order of datetime string to YYYY-MM-DD:
data = data.loc['2006-06-01':'2006-06-09']
Sample:
data = pd.DataFrame({'col':range(15)}, index=pd.date_range('2006-06-01','2006-06-15'))
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8
2006-06-10 9
2006-06-11 10
2006-06-12 11
2006-06-13 12
2006-06-14 13
2006-06-15 14
data = data.loc['2006-06-01':'2006-06-09']
print (data)
col
2006-06-01 0
2006-06-02 1
2006-06-03 2
2006-06-04 3
2006-06-05 4
2006-06-06 5
2006-06-07 6
2006-06-08 7
2006-06-09 8
As I what I want is to create a new DataFrame with specific dates from the original DataFrame, I convert the column 'Date' as Index
data = data.set_index(data['Date'])
And then just create the new Data Frame using loc
data1 = data.loc['01/06/2006':'09/06/2006']
I am quite new to Python and I thought that I needed to convert to datetime the column 'Date' which is string, but apparently is not necessary. Thanks for your help #jezrael