I have a dataframe (df) that has employee start and end times formatted at strings
emp_id|Start|End
001|07:00:00|04:00:00
002|07:30:00|04:30:00
I want to add two hours to the Start and 2 hours to the End on a set of employees, not all employees. I do this by taking a slice of the main dataframe into a separate dataframe (df2). I then update the values and need to merge the updated values back into the main dataframe (df1) where I will coerce back to a string, as there is a method later in the code expecting these values to be strings.
I tried doing this:
df1['Start'] = pd.to_datetime(df1.Start)
df1['End'] = pd.to_datetime(df1.End)
df2 = df1.sample(frac=0.1, replace=False, random_state=1) #takes a random 10% slice
df2['Start'] = df2['Start'] + timedelta(hours=2)
df2['End'] = df2['End'] + timedelta(hours=2)
df1.loc[df1.emp_id.isin(df2.emp_id), ['Start, 'End']] = df2[['Start', 'End']]
df1['Start'] = str(df1['Start'])
df1['End'] = str(df1['End']))
I'm getting a TypeError: addition/subtraction of integers and integer arrays with DateTimeArray is no longer supported. How do I do this in Python3?
You can use .applymap() on the Start and End columns of your selected subset. Hour addition can be done by string extraction and substitution.
Code
df1 = pd.DataFrame({
"emp_id": ['001', '002'],
"Start": ['07:00:00', '07:30:00'],
"End": ['04:00:00', '04:30:00'],
})
# a subset of employee id
set_id = set(['002'])
# locate the subset
mask = df1["emp_id"].isin(set_id)
# apply hour addition
df1.loc[mask, ["Start", "End"]] = df1.loc[mask, ["Start", "End"]].applymap(lambda el: f"{int(el[:2])+2:02}{el[2:]}")
Result
print(df1)
emp_id Start End
0 001 07:00:00 04:00:00
1 002 09:30:00 06:30:00 <- 2 hrs were added
Note: f-strings require python 3.6+. For earlier versions, replace the f-string with
"%02d%s" % (int(el[:2])+2, el[2:])
Note: mind corner cases (time later than 22:00) if they exist.
Related
I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.
The resulting df should look like this:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
What I have tried is to first assign a group index:
df['group_index'] = df.groupby('Callsign').cumcount()
Then groupby
Then try and apply the function:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.
The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.
Any ideas on how to fix this would be much appreciated.
Sort df by callsign then timestamp
Compute distances between adjacent rows using a temporary column of shifted points
For the first row of each new callsign, set distance to 0
Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645
I have a dataframe which contains list value, let us call it df1:
Text
-------
["good", "job", "we", "are", "so", "proud"]
["it", "was", "his", "honor", "as", "well", "as", "guilty"]
And also another dataframe, df2:
Word Value
-------------
good 7.47
proud 8.03
honor 7.66
guilty 2.63
I want to create apply plus lambda function to create df1['score'] where the values are derived from sum-aggregating words per list in df1 which are found in df2's words. Currently, this is my code:
def score(list_word):
sum = count = mean = sd = 0
for word in list_word:
if word in df2['Word']:
sum = sum + df2.loc[df2['Word'] == word, 'Value'].iloc[0]
count = count + 1
if count != 0:
return sum/count
else:
return 0
df['score'] = df.apply(lambda x: score(x['words']), axis=1)
This is what I envision:
Score
-------
7.75 #average of good (7.47) and proud (8.03)
5.145 #average of honor (7.66) and guilty (2.63)
However, it seems x['words'] did not pass as list object, and I do not know how to modify the score function to meet the object type. I try to convert it by tolist() method, but no avail. Any help appreciated.
Giving the first df1, and df2 with explode and map , Notice explode is after pandas 0.25
#import ast
#df1.Text=df1.Text.apply(ast.literal_eval)
#If the list is string type , we need bring the format list back with fast
s=df1.Text.explode().map(dict(zip(df2.Word,df2.Value))).mean(level=0)
0 7.750
1 5.145
Name: Text, dtype: float64
Update
df1.Text.explode().to_frame('Word').reset_index().merge(df2,how='left').groupby('index').mean()
Value
index
0 7.750
1 5.145
I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.
Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07
I have a dataset of measured values and their corresponding timestamps in the format hh:mm:ss, where hh can be > 24 h.
For machine learning tasks, the data need to be interpolated since there are multiple measured values with different timestamps, respectively.
For resampling and interpolation, I figuered out that the dtype of the index should be in the datetime-format.
For further data-processing and machine learning tasks, I would need the timedelta format again.
Here is some code:
Res_cont = Res_cont.set_index('t_a') #t_a is the column of the timestamps for the measured variable a from a dataframe
#Then, I need to change datetime-format for resampling and interpolation, otherwise timedate are not like 00:15:00, but like 00:15:16 for example
Res_cont.index = pd.to_datetime(Res_cont.index)
#first, upsample to seconds, then interpolate linearly and downsample to 15min steps, lastly
Res_cont = Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index) #Here is, where the error ocurred
Unfortunatly, I get the following Error message:
FutureWarning: Passing datetime64-dtype data to TimedeltaIndex is
deprecated, will raise a TypeError in a future version Res_cont =
pd.to_timedelta(Res_cont.index)
So obviously, there is a problem with the last row of my provided code. I would like to know, how to change this code to prevent a Type Error in a future version. Unfortunatly, I don't have any idea how to fix it.
Maybe you can help?
EDIT: Here some arbitrary sample data:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '25:45:44']
a = [0, 1.3, 2.4, 3.8, 4.9]
Res_cont = pd.Series(data = a, index = t_a)
You can use DatetimeIndex.strftime for convert output datetimes to HH:MM:SS format:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '00:45:44']
a = [0, 1, 2, 3, 4]
Res_cont = pd.DataFrame({'t_a':t_a,'a':a})
print (Res_cont)
t_a a
0 00:00:26 0
1 00:16:16 1
2 00:25:31 2
3 00:36:14 3
4 00:45:44 4
Res_cont = Res_cont.set_index('t_a')
Res_cont.index = pd.to_datetime(Res_cont.index)
Res_cont=Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index.strftime('%H:%M:%S'))
print (Res_cont)
a
00:15:00 0.920000
00:30:00 2.418351
00:45:00 3.922807
I come from an Excel background but I love pandas and it has truly made me more efficient. Unfortunately, I probably carry over some bad habits from Excel. I have three large files (between 2 million and 13 million rows each) which contain data on interactions which could be tied together, unfortunately, there is no unique key connecting the files. I am literally concatenating (Excel formula) 3 fields into one new column on all three files.
Three columns which exist on each file which I combined together (the other fields would be like the reason for interaction on one file, the score on another file, and the some other data on the third file which I would like to tie together back to a certain agentID):
Date | CustomerID | AgentID
I edit my date format to be uniform on each file:
df[Date] = pd.to_datetime(df['Date'], coerce = True)
df[Date] = df[Date].apply(lambda x:x.date().strftime('%Y-%m-%d'))
Then I create a unique column (well, as unique as I can get it.. sometimes the same customer interacts with the same agent on the same date but this should be quite rare):
df[Unique] = df[Date].astype(str) + df[CustomerID].astype(str) + df[AgentID].astype(str)
I do the same steps for df2 and then:
combined = pd.merge(df, df2, how = 'left', on = 'Unique')
I typically send that to a new csv in case something crashes, gzip it, then read it again and do the same process again with the third file.
final = pd.merge(combined, df2, how = 'left', on = 'Unique')
As you can see, this takes time. I have to format the dates on each and then turn them into text, create an object column which adds to the filesize, and (due to the raw data issues themselves) drop duplicates so I don't accidentally inflate numbers. Is there a more efficient workflow for me to follow?
Instead of using on = 'Unique':
combined = pd.merge(df, df2, how = 'left', on = 'Unique')
you can pass a list of columns to the on keyword parameter:
combined = pd.merge(df, df2, how='left', on=['Date', 'CustomerID', 'AgentID'])
Pandas will correctly merge rows based on the triplet of values from the 'Date', 'CustomerID', 'AgentID' columns. This is safer (see below) and easier than building the Unique column.
For example,
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame({'Date': pd.to_datetime(['2000-1-1','2000-1-1','2000-1-2']),
'CustomerID':[1,1,2],
'AgentID':[10,10,11]})
df2 = df.copy()
df3 = df.copy()
L = len(df)
df['ABC'] = np.random.choice(list('ABC'), L)
df2['DEF'] = np.random.choice(list('DEF'), L)
df3['GHI'] = np.random.choice(list('GHI'), L)
df2 = df2.iloc[[0,2]]
combined = df
for x in [df2, df3]:
combined = pd.merge(combined, x, how='left', on=['Date','CustomerID', 'AgentID'])
yields
In [200]: combined
Out[200]:
AgentID CustomerID Date ABC DEF GHI
0 10 1 2000-1-1 C F H
1 10 1 2000-1-1 C F G
2 10 1 2000-1-1 A F H
3 10 1 2000-1-1 A F G
4 11 2 2000-1-2 A F I
A cautionary note:
Adding the CustomerID to the AgentID to create a Unique ID could be problematic
-- particularly if neither has a fixed-width format.
For example, if CustomerID = '12' and AgentID = '34' Then (ignoring the date which causes no problem since it does have a fixed-width) Unique would be
'1234'. But if CustomerID = '1' and AgentID = '234' then Unique would
again equal '1234'. So the Unique IDs may be mixing entirely different
customer/agent pairs.
PS. It is a good idea to parse the date strings into date-like objects
df['Date'] = pd.to_datetime(df['Date'], coerce=True)
Note that if you use
combined = pd.merge(combined, x, how='left', on=['Date','CustomerID', 'AgentID'])
it is not necessary to convert any of the columns back to strings.