Make a for loop for a dataframe to substract dates and put it in a variable - pandas

I have a dataframe that looks like this with a lot of products
Product
Start Date
00001
2021/08/10
00002
2021/01/10
I want to make a cycle so that it goes from product to product subtracting three months from the date and then putting it in a variable, something like this.
date[]=''
for dataframe in i:
date['3monthsbefore']=i['start date']-3 months
date['3monthsafter']=i['start date']+3 months
date['product']=i['product']
"Another process with those variables"
And then concat all this data in a dataframe I´m a little bit lost,
I want to do this because I need to use those variables in another process, so I't is possible to do this?.

Using pandas, you usually don't need to loop over your DataFrame. In this case, you can get the 3 months before/after for all rows pretty simply using pd.DateOffset:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["3monthsbefore"] = df["Start Date"] - pd.DateOffset(months=3)
df["3monthsafter"] = df["Start Date"] + pd.DateOffset(months=3)
This gives:
Product Start Date 3monthsbefore 3monthsafter
0 00001 2021-08-10 2021-05-10 2021-11-10
1 00002 2021-01-10 2020-10-10 2021-04-10
Data:
df = pd.DataFrame({"Product": ["00001", "00002"], "Start Date": ["2021/08/10", "2021/01/10"]})

Related

Pandas create graph from Date and Time while them being in different columns

My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Pandas group by date and get count while removing duplicates

I have a data frame that looks like this:
maid date hour count
0 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 13 2
1 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 15 1
2 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13 23 14
3 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-14 0 1
4 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11 14 2
5 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13 7 1
I am trying get a count of maid's for each date in such a way that if a maid is included in day 1, I don't want to include in any of the subsequent days. For example, 0589b8a3-9d33-4db4-b94a-834cc8f46106 is present in both 13th as well as 14. I want to include the maid in the count for 13th but not on 14th as it is already included in 13th.
I have written the following code and it works for small data frames:
import pandas as pd
df=pd.read_csv('/home/ubuntu/uniqueSiteId.csv')
umaids=[]
tdf=[]
df['date']=pd.to_datetime(df.date)
df=df.sort_values('date')
df=df[['maid','date']]
df=df.drop_duplicates(['maid','date'])
dts=df['date'].unique()
for dt in dts:
if not umaids:
df1=df[df['date']==dt]
k=df1['maid'].unique()
umaids.extend(k)
dff=df1
fdf=df1.values.tolist()
elif umaids:
dfs=df[df['date']==dt]
df2=dfs[~dfs['maid'].isin(umaids)]
umaids.extend(df2['maid'].unique())
sdf=df2.values.tolist()
tdf.append(sdf)
ftdf = [item for t in tdf for item in t]
ndf=fdf+ftdf
ndf=pd.DataFrame(ndf,columns=['maid','date'])
print(ndf)
Since I have 1000's of data frames and most often my data frame is more than a million rows, the above takes a long time to run. Is there a better way to do this.
The expected output is this:
maid date
0 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11
1 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13
2 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13
3 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14
As per discussion in the comments, the solution is quite simple: sort the dataframe by date and then drop duplicates only by maid. This will keep the first occurence of maid, which also happens to be the first occurence in time since we sorted by date. Then do the groupby as usual.

Pandas: Date difference loop between columns with similiar names (ACD and ECD)

I'm working in Jupyter and have a large number of columns, many of them dates. I want to create a loop that will return a new column with the date difference between two similarly-named columns.
For example:
df['Site Visit ACD']
df['Site Visit ECD']
df['Sold ACD (Loc A)']
df['Sold ECD (Loc A)']
The new column will have a column df['Site Visit Cycle Time'] = date difference between ACD and ECD. Generally, it will always be the column that contains "ACD" minus the column that contains "ECD". How can I write this?
Any help appreciated!
The following code will do the following:
Find columns that are similar (over 90 ratio fuzz using fuzzywuzzy package)
Perform the date comparison (or time)
Avoid the same computation to be performed on both sides
get the name 'Site Visit' if the column is called more or less like that
get the name 'difference between 'column 1' and 'column 2' if it is called differently
I hope it helps.
import pandas as pd
from fuzzywuzzy import fuzz
name = pd.read_excel('Book1.xlsx', sheet_name='name')
unique = []
for i in name.columns:
for j in name.columns:
if i != j and fuzz.ratio(i, j) > 90 and i+j not in unique:
if 'Site Visit' in i:
name['Site Visit'] = name[i] - name[j]
else:
name['difference between '+i+' and '+j] = name[i] - name[j]
unique.append(j+i)
unique.append(i+j)
print(name)
Generally, it will always be the column that contains "ACD" minus the column that contains "ECD".
This answer assumes the column titles are not noisy, i.e. they only differ in "ACD" / "ECD" and are exactly the same apart from that (upper/lower case included). Also assuming that there always is a matching column. This code doesn't check if it overwrites the column it writes the date difference to.
This approach works in linear time, as we iterate over the set of columns once and directly access the matching column by name.
test.csv
Site Visit ECD,Site Visit ACD,Sold ECD (Loc A),Sold ACD (Loc A)
2018-06-01,2018-06-04,2018-07-05,2018-07-06
2017-02-22,2017-03-02,2017-02-27,2017-03-02
Code
import pandas as pd
df = pd.read_csv("test.csv", delimiter=",")
for col_name_acd in df.columns:
# Skip columns that don't have "ACD" in their name
if "ACD" not in col_name_acd: continue
col_name_ecd = col_name_acd.replace("ACD", "ECD")
# we assume there is always a matching "ECD" column
assert col_name_ecd in df.columns
col_name_diff = col_name_acd.replace("ACD", "Cycle Time")
df[col_name_diff] = df[col_name_acd].astype('datetime64[ns]') - df[col_name_ecd].astype('datetime64[ns]')
print(df.head())
Output
Site Visit ECD Site Visit ACD Sold ECD (Loc A) Sold ACD (Loc A) \
0 2018-06-01 2018-06-04 2018-07-05 2018-07-06
1 2017-02-22 2017-03-02 2017-02-27 2017-03-02
Site Visit Cycle Time Sold Cycle Time (Loc A)
0 3 days 1 days
1 8 days 3 days

Calculating Weekly Returns from Daily Time Series of Prices

I want to calculate weekly returns of a mutual fund from a time series of daily prices. My data looks like this:
A B C D E
DATE WEEK W.DAY MF.PRICE WEEKLY RETURN
02/01/12 1 1 2,7587
03/01/12 1 2 2,7667
04/01/12 1 3 2,7892
05/01/12 1 4 2,7666
06/01/12 1 5 2,7391 -0,007
09/01/12 2 1 2,7288
10/01/12 2 2 2,6707
11/01/12 2 3 2,7044
12/01/12 2 4 2,7183
13/01/12 2 5 2,7619 0,012
16/01/12 3 1 2,7470
17/01/12 3 2 2,7878
18/01/12 3 3 2,8156
19/01/12 3 4 2,8310
20/01/12 3 5 2,8760 0,047
The date is (dd/mm/yy) format and "," is decimal separator. This would be done by using this formula: (Price for last weekday - Price for first weekday)/(Price for first weekday). For example the return for the first week is (2,7391 - 2,7587)/2,7587 = -0,007 and for the second is (2,7619 - 2,7288)/2,7288 = 0,012.
The problem is that the list goes on for a year, and some weeks have less than five working days due to holidays or other reasons. So I can't simply copy and paste the formula above. I added the extra two columns for week number and week day using WEEKNUM and WEEKDAY functions, thought it might help. I want to automate this with a formula or using VBA and hoping to get a table like this:
WEEK RETURN
1 -0,007
2 0,012
3 0,047
.
.
.
As I said some weeks have less than five weekdays, some start with weekday 2 or end with weekday 3 etc. due to holidays or other reasons. So I'm thinking of a way to tell excel to "find the prices that correspond to the max and min weekday of each week and apply the formula (Price for last weekday - Price for first weekday)/(Price for first weekday)".
Sorry for the long post, I tried to be be as clear as possible, I would appreciate any help! (I have 5 separate worksheets for consecutive years, each with daily prices of 20 mutual funds)
To do it in one formula:
=(INDEX(D:D,AGGREGATE(15,6,ROW($D$2:$D$16)/(($C$2:$C$16=AGGREGATE(14,6,$C$2:$C$16/($B$2:$B$16=G2),1))*($B$2:$B$16=G2)),1))-INDEX(D:D,MATCH(G2,B:B,0)))/INDEX(D:D,MATCH(G2,B:B,0))
You may need to change all the , to ; per your local settings.
I would solve it using some lookup formulas to get the values for each week and then do a simple calculation for each week.
Resulting table:
H I J K L M
first last first val last val return
1 02.01.2012 06.01.2012 2,7587 2,7391 -0,007
2 09.01.2012 13.01.2012 2,7288 2,7619 0,012
3 16.01.2012 20.01.2012 2,747 2,876 0,047
Formula in column I:
=MINIFS($A:$A;$B:$B;$H2)
Fomula in column J:
=MAXIFS($A:$A;$B:$B;$H2)
Formula in column K:
=VLOOKUP($I2;$A:$D;4;FALSE)
Formula in column L:
=VLOOKUP($J2;$A:$D;4;FALSE)
Formula in column M:
=(L2-K2)/K2