Pandas - Take value n month before - pandas

I am working with datetime. Is there anyway to get a value of n months before.
For example, the data look like:
dft = pd.DataFrame(
np.random.randn(100, 1),
columns=["A"],
index=pd.date_range("20130101", periods=100, freq="M"),
)
dft
Then:
For every Jul of each year, we take value of December in previous year and apply it to June next year
For other month left (from Aug this year to June next year), we take value of previous month
For example: that value from Jul-2000 to June-2001 will be the same and equal to value of Dec-1999.
What I've been trying to do is:
dft['B'] = np.where(dft.index.month == 7,
dft['A'].shift(7, freq='M') ,
dft['A'].shift(1, freq='M'))
However, the result is simply a copy of column A. I don't know why. But when I tried for single line of code :
dft['C'] = dft['A'].shift(7, freq='M')
then everything is shifted as expected. I don't know what is the issue here

The issue is index alignment. This shift that you performed acts on the index, but using numpy.where you convert to arrays and lose the index.
Use pandas' where or mask instead, everything will remain as Series and the index will be preserved:
dft['B'] = (dft['A'].shift(1, freq='M')
.mask(dft.index.month == 7, dft['A'].shift(7, freq='M'))
)
output:
A B
2013-01-31 -2.202668 NaN
2013-02-28 0.878792 -2.202668
2013-03-31 -0.982540 0.878792
2013-04-30 0.119029 -0.982540
2013-05-31 -0.119644 0.119029
2013-06-30 -1.038124 -0.119644
2013-07-31 0.177794 -1.038124
2013-08-31 0.206593 -2.202668 <- correct
2013-09-30 0.188426 0.206593
2013-10-31 0.764086 0.188426
... ... ...
2020-12-31 1.382249 -1.413214
2021-01-31 -0.303696 1.382249
2021-02-28 -1.622287 -0.303696
2021-03-31 -0.763898 -1.622287
2021-04-30 0.420844 -0.763898
[100 rows x 2 columns]

Related

pandas merge not working left_on, right_on method

I'm trying to change a column heading but pandas won't let me. It has changed one of the columns I asked, but not the other.
new_df2 = new_df2.rename(columns={'value': 'Euro to Dollar rate','date':'exchange rate date'})
output:
exchange rate date value
0 2020-12-31 1.2216
1 2020-12-31 1.2216
2 2020-12-31 1.2216
3 2020-12-31 1.2216
4 NaT NaN
Goal is to change 'value' column to 'Euro to Dollar rate'
print (new_df2.columns.tolist()) returns:'exchange rate date', ' value']
This resulted in me spotting the space character in 'value'. Removed it and it works.

Make a for loop for a dataframe to substract dates and put it in a variable

I have a dataframe that looks like this with a lot of products
Product
Start Date
00001
2021/08/10
00002
2021/01/10
I want to make a cycle so that it goes from product to product subtracting three months from the date and then putting it in a variable, something like this.
date[]=''
for dataframe in i:
date['3monthsbefore']=i['start date']-3 months
date['3monthsafter']=i['start date']+3 months
date['product']=i['product']
"Another process with those variables"
And then concat all this data in a dataframe I´m a little bit lost,
I want to do this because I need to use those variables in another process, so I't is possible to do this?.
Using pandas, you usually don't need to loop over your DataFrame. In this case, you can get the 3 months before/after for all rows pretty simply using pd.DateOffset:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["3monthsbefore"] = df["Start Date"] - pd.DateOffset(months=3)
df["3monthsafter"] = df["Start Date"] + pd.DateOffset(months=3)
This gives:
Product Start Date 3monthsbefore 3monthsafter
0 00001 2021-08-10 2021-05-10 2021-11-10
1 00002 2021-01-10 2020-10-10 2021-04-10
Data:
df = pd.DataFrame({"Product": ["00001", "00002"], "Start Date": ["2021/08/10", "2021/01/10"]})

Pandas group by date and get count while removing duplicates

I have a data frame that looks like this:
maid date hour count
0 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 13 2
1 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 15 1
2 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13 23 14
3 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-14 0 1
4 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11 14 2
5 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13 7 1
I am trying get a count of maid's for each date in such a way that if a maid is included in day 1, I don't want to include in any of the subsequent days. For example, 0589b8a3-9d33-4db4-b94a-834cc8f46106 is present in both 13th as well as 14. I want to include the maid in the count for 13th but not on 14th as it is already included in 13th.
I have written the following code and it works for small data frames:
import pandas as pd
df=pd.read_csv('/home/ubuntu/uniqueSiteId.csv')
umaids=[]
tdf=[]
df['date']=pd.to_datetime(df.date)
df=df.sort_values('date')
df=df[['maid','date']]
df=df.drop_duplicates(['maid','date'])
dts=df['date'].unique()
for dt in dts:
if not umaids:
df1=df[df['date']==dt]
k=df1['maid'].unique()
umaids.extend(k)
dff=df1
fdf=df1.values.tolist()
elif umaids:
dfs=df[df['date']==dt]
df2=dfs[~dfs['maid'].isin(umaids)]
umaids.extend(df2['maid'].unique())
sdf=df2.values.tolist()
tdf.append(sdf)
ftdf = [item for t in tdf for item in t]
ndf=fdf+ftdf
ndf=pd.DataFrame(ndf,columns=['maid','date'])
print(ndf)
Since I have 1000's of data frames and most often my data frame is more than a million rows, the above takes a long time to run. Is there a better way to do this.
The expected output is this:
maid date
0 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11
1 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13
2 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13
3 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14
As per discussion in the comments, the solution is quite simple: sort the dataframe by date and then drop duplicates only by maid. This will keep the first occurence of maid, which also happens to be the first occurence in time since we sorted by date. Then do the groupby as usual.

How to compute the difference in monthly income for the same id

The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)