Pandas: Date difference loop between columns with similiar names (ACD and ECD) - pandas

I'm working in Jupyter and have a large number of columns, many of them dates. I want to create a loop that will return a new column with the date difference between two similarly-named columns.
For example:
df['Site Visit ACD']
df['Site Visit ECD']
df['Sold ACD (Loc A)']
df['Sold ECD (Loc A)']
The new column will have a column df['Site Visit Cycle Time'] = date difference between ACD and ECD. Generally, it will always be the column that contains "ACD" minus the column that contains "ECD". How can I write this?
Any help appreciated!

The following code will do the following:
Find columns that are similar (over 90 ratio fuzz using fuzzywuzzy package)
Perform the date comparison (or time)
Avoid the same computation to be performed on both sides
get the name 'Site Visit' if the column is called more or less like that
get the name 'difference between 'column 1' and 'column 2' if it is called differently
I hope it helps.
import pandas as pd
from fuzzywuzzy import fuzz
name = pd.read_excel('Book1.xlsx', sheet_name='name')
unique = []
for i in name.columns:
for j in name.columns:
if i != j and fuzz.ratio(i, j) > 90 and i+j not in unique:
if 'Site Visit' in i:
name['Site Visit'] = name[i] - name[j]
else:
name['difference between '+i+' and '+j] = name[i] - name[j]
unique.append(j+i)
unique.append(i+j)
print(name)

Generally, it will always be the column that contains "ACD" minus the column that contains "ECD".
This answer assumes the column titles are not noisy, i.e. they only differ in "ACD" / "ECD" and are exactly the same apart from that (upper/lower case included). Also assuming that there always is a matching column. This code doesn't check if it overwrites the column it writes the date difference to.
This approach works in linear time, as we iterate over the set of columns once and directly access the matching column by name.
test.csv
Site Visit ECD,Site Visit ACD,Sold ECD (Loc A),Sold ACD (Loc A)
2018-06-01,2018-06-04,2018-07-05,2018-07-06
2017-02-22,2017-03-02,2017-02-27,2017-03-02
Code
import pandas as pd
df = pd.read_csv("test.csv", delimiter=",")
for col_name_acd in df.columns:
# Skip columns that don't have "ACD" in their name
if "ACD" not in col_name_acd: continue
col_name_ecd = col_name_acd.replace("ACD", "ECD")
# we assume there is always a matching "ECD" column
assert col_name_ecd in df.columns
col_name_diff = col_name_acd.replace("ACD", "Cycle Time")
df[col_name_diff] = df[col_name_acd].astype('datetime64[ns]') - df[col_name_ecd].astype('datetime64[ns]')
print(df.head())
Output
Site Visit ECD Site Visit ACD Sold ECD (Loc A) Sold ACD (Loc A) \
0 2018-06-01 2018-06-04 2018-07-05 2018-07-06
1 2017-02-22 2017-03-02 2017-02-27 2017-03-02
Site Visit Cycle Time Sold Cycle Time (Loc A)
0 3 days 1 days
1 8 days 3 days

Related

Numpy, Pandas Exercise

"[it just needs to be done using numpy and pandas.]"
Your task:
You are asked to write a function that applies ”slack time remaining” (STR) sequencing rule to a given collection of jobs. Although this rule has not been covered in class, application is very similar to critical ratio. You need to calculate STR value for all jobs and schedule the one with the lowest STR. Continue this until all jobs are scheduled. The STR values are calculated as follows:
STR = [Time Until Due Date] − [Processing Time]
If you have more than 1 job with the lowest STR, break ties with Earliest Due Date
(EDD) rule. If due dates are also the same, schedule the one arrived earlier (that means the one in the upper rows of the table.)
Your function will accept a single parameter as pandas dataframe:
Function Parameter:
df jobs: A pandas dataframe whose indexes are the names of the jobs. Jobs
are assumed to be arrived in the same day in the same order given in the dataframe. There
will be two data columns in the dataframe:
ˆ ”Processing Time”: Processing time required for the job
ˆ ”Due Date”: Time between arrival of the job and the due date of the job.
Output: Your function should return a list containing the correct sequence according to STR rule.
Example inputs and expected outputs:
Example Input Data:
Job Processing Time Due Date
A 2 7
B 8 16
C 4 4
D 10 17
E 5 15
F 12 18
Expected Output: [’C’, ’A’, ’F’, ’D’, ’B’, ’E’]
Assuming your input is a DataFrame - your function would be:
def str_list(df):
df = df.set_index('Job')
return (df['Due Date'] - df['Processing Time']).sort_values().index.tolist()

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')