I want to insert a new column in pandas followed with some sort of grouping condition - pandas

enter image description here
I want to add a column for duration which subtracts max date from min date for each company(ComapnyName)

Use:
m1 = df.groupby('CompanyName')['Date'].transform('max')
m2 = df.groupby('CompanyName')['Date'].transform('min')
df['duration'] = (m1 - m2).dt.days

Related

Trying to mask a dataset based on multiple conditions

I am trying to mask a dataset I have based on two parameters
That I mask any station has a repeat value more than once per hour
a) I want the count to reset once the clock hits a
new hour
That I mask the data whenever the previous datapoint is lower than the current datapoint if it's within the hour and the station names are the same.
The mask I applied to it is this
mask = (df['station'] == df['station'].shift(1))
& (df['precip'] >= df['precip'].shift(1))
& (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour'))
df.loc[mask, 'to_remove'] = True
However it is not working properly giving me a df that looks like this
I want a dataframe that looks like this
Basically you want to mask two things, the first being a duplicate value per station & hour. This can be found by grouping by station and the hour of valid, plus the precip column. On this group by you can count the number of occurances and check if it is more than one:
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour), df.precip] #groupby the columns
).transform('count') > 1 #count the values and check if more than 1
The second one is not clear to me whether you want to reset once the clock hits a new hour (as mentioned in the first part of the mask). If this is also the case, you need to group by station and hour, and check values using shift (as you tried):
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour)] #group by station and hour
).precip.apply(lambda x: x < x.shift(1)) #value lower than previous
If this is not the case, as suggested by your expected output, you only group by station:
df.groupby(df.station).precip.apply(
lambda x: ( x < x.shift(1) ) & #value is less than previous
( abs(df['valid'] - df['valid'].shift(1) ) < pd.Timedelta('1 hour') ) #within hour
)
Combining these two masks will let you drop the right rows:
df['sameValueWithinHour'] = df.groupby([df.station, df['valid'].apply(lambda x: x.hour), df.precip]).transform('count')>1
df['previousValuelowerWithinHour'] = df.groupby(df.station).precip.transform(lambda x: (x<x.shift(1)) & (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour')))
df['to_remove'] = df.sameValueWithinHour | df.previousValuelowerWithinHour
df.loc[~df.to_remove, ['station', 'valid', 'precip']]
station valid precip
2 btv 2022-02-23 00:55:00 0.5
4 btv 2022-02-23 01:12:00 0.3

Filter out items in pandas dataframe by index number (in one column)

I have a column in my dataframe that goes indexwise from 900 to 1200.
Now I want to extract that index values automatically. Does anyone know how I could do that? I either would like to have the difference (1200-900) or the range [900,1200].
IIUC use:
out = (df.index.min(), df.index.max())
dif = df.index.max() - df.index.min()
#1200 missing
r = range(df.index.min(), df.index.max())
#include 1200
r1 = range(df.index.min(), df.index.max() + 1)

Generate time series dates after a certain date in each group in Pyspark dataframe

I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!

How to get the year where the max and min value occurs in pandas

enter image description here
Please take a look at the picture for df
I have this set of data-frame. I found the mean, max, and min value that occurred from 2008-2012 using:
mean=(df.iloc[1:6,1].mean())
print(f"Mean for dividends from 2008-2012 is {mean}")
column=(df.iloc[1:6,1])
max_value=column.max()
min_value=column.min()
How am I supposed to get the year where the max and min occurred from 2008-2012?
You can get row number using argmin and argmax
min_value_row = column.argmin()
max_value_row = column.argmax()
Then you can use it to get date (expecting 0 column is the date):
date_min = df.iloc[column.argmin(),0]
date_max = df.iloc[column.argmax(),0]
In your case:
column_dividends_selected = df.Dividends.iloc[1:6]
min_value = column_dividends_selected.min()
max_value = column_dividends_selected.max()
min_value_row = column_dividends_selected.argmin()
max_value_row = column_dividends_selected.argmax()
column_date_selected = df.Year.iloc[1:6]
date_min = column_date_selected.iloc[min_value_row]
date_max = column_date_selected.iloc[max_value_row]

Pandas apply cumprod to specific index

I have dataframe consists of accountId, date , return value of that account on that date, and inflation rate on that date.
Date column demonstrates that how long account has been in system, for example accountId 1 get into system on 2016-01 and get out on 2019-11.
formula:
df["Inflation"] = ((1+ df["Inflation"]).cumprod() - 1) * 100
I want to apply this formula to the all accounts but here is the problem.
When I have dataframe consists of only one account it's too easy to apply formula, but when I create a dataframe consists of all users(as I indicated in image) I don't want to apply that formula simply, because every account has different date interval some of them get into system 2016 some of them 2017.
You can imagine like this, let's suppose I have dataframe of all accounts, for example df1 for account1 df2 for account2 and so on. And I want to apply that formula to each dataframe individually, and finally I want to merge all of them and have one dataframe consists of all accounts.
df["Inflation2"] = ((1+df.groupby(["AccountId","Inflation"])).cumprod()-1) * 100
I tried this code but It gives me error like "unsupported operand type(s) for +: 'int' and 'DataFrameGroupBy'"
Thanks in advance...
I solved it as follows:
df["Inflation"] = df.groupby(["AccountId"]).Inflation.apply(lambda x: (x + 1).cumprod()-1) * 100