How to get the year where the max and min value occurs in pandas - python-3.8

enter image description here
Please take a look at the picture for df
I have this set of data-frame. I found the mean, max, and min value that occurred from 2008-2012 using:
mean=(df.iloc[1:6,1].mean())
print(f"Mean for dividends from 2008-2012 is {mean}")
column=(df.iloc[1:6,1])
max_value=column.max()
min_value=column.min()
How am I supposed to get the year where the max and min occurred from 2008-2012?

You can get row number using argmin and argmax
min_value_row = column.argmin()
max_value_row = column.argmax()
Then you can use it to get date (expecting 0 column is the date):
date_min = df.iloc[column.argmin(),0]
date_max = df.iloc[column.argmax(),0]
In your case:
column_dividends_selected = df.Dividends.iloc[1:6]
min_value = column_dividends_selected.min()
max_value = column_dividends_selected.max()
min_value_row = column_dividends_selected.argmin()
max_value_row = column_dividends_selected.argmax()
column_date_selected = df.Year.iloc[1:6]
date_min = column_date_selected.iloc[min_value_row]
date_max = column_date_selected.iloc[max_value_row]

Related

Trying to mask a dataset based on multiple conditions

I am trying to mask a dataset I have based on two parameters
That I mask any station has a repeat value more than once per hour
a) I want the count to reset once the clock hits a
new hour
That I mask the data whenever the previous datapoint is lower than the current datapoint if it's within the hour and the station names are the same.
The mask I applied to it is this
mask = (df['station'] == df['station'].shift(1))
& (df['precip'] >= df['precip'].shift(1))
& (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour'))
df.loc[mask, 'to_remove'] = True
However it is not working properly giving me a df that looks like this
I want a dataframe that looks like this
Basically you want to mask two things, the first being a duplicate value per station & hour. This can be found by grouping by station and the hour of valid, plus the precip column. On this group by you can count the number of occurances and check if it is more than one:
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour), df.precip] #groupby the columns
).transform('count') > 1 #count the values and check if more than 1
The second one is not clear to me whether you want to reset once the clock hits a new hour (as mentioned in the first part of the mask). If this is also the case, you need to group by station and hour, and check values using shift (as you tried):
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour)] #group by station and hour
).precip.apply(lambda x: x < x.shift(1)) #value lower than previous
If this is not the case, as suggested by your expected output, you only group by station:
df.groupby(df.station).precip.apply(
lambda x: ( x < x.shift(1) ) & #value is less than previous
( abs(df['valid'] - df['valid'].shift(1) ) < pd.Timedelta('1 hour') ) #within hour
)
Combining these two masks will let you drop the right rows:
df['sameValueWithinHour'] = df.groupby([df.station, df['valid'].apply(lambda x: x.hour), df.precip]).transform('count')>1
df['previousValuelowerWithinHour'] = df.groupby(df.station).precip.transform(lambda x: (x<x.shift(1)) & (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour')))
df['to_remove'] = df.sameValueWithinHour | df.previousValuelowerWithinHour
df.loc[~df.to_remove, ['station', 'valid', 'precip']]
station valid precip
2 btv 2022-02-23 00:55:00 0.5
4 btv 2022-02-23 01:12:00 0.3

Filter out items in pandas dataframe by index number (in one column)

I have a column in my dataframe that goes indexwise from 900 to 1200.
Now I want to extract that index values automatically. Does anyone know how I could do that? I either would like to have the difference (1200-900) or the range [900,1200].
IIUC use:
out = (df.index.min(), df.index.max())
dif = df.index.max() - df.index.min()
#1200 missing
r = range(df.index.min(), df.index.max())
#include 1200
r1 = range(df.index.min(), df.index.max() + 1)

Generate time series dates after a certain date in each group in Pyspark dataframe

I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!

Pandas: Creating an index year ago for multiple items

I can create an index vs the previous year when I have just one item, but I'm trying to figure out how to do this when I have multiple items. Here is my data set:
rng = pd.date_range('1/1/2011', periods=3, freq='Y')
rng = np.repeat(rng,3)
country = ["USA","Brazil","Japan"]*3
df = pd.DataFrame({'Country':country,'date':rng,'value':range(20,29)})
If I only had one item/country I can do something like this:
df['pct_iya'] = 100*(df['value'].pct_change()+1)
I'm trying to get this to work with multiple items. Here is the expected result:
Maybe this could work with a groupby, but my attempt did not work...
df['pct_iya2'] = df.groupby(['Country','date'])['value'].pct_change()
Answer: Use a group by (excluding date) and than add one to the percent change (ex +15percent goes from .15 to 1.15), then multiple you 100.
df['pct_iya'] = 100*(df.groupby(['Country'])['value'].pct_change()+1)

I want to insert a new column in pandas followed with some sort of grouping condition

enter image description here
I want to add a column for duration which subtracts max date from min date for each company(ComapnyName)
Use:
m1 = df.groupby('CompanyName')['Date'].transform('max')
m2 = df.groupby('CompanyName')['Date'].transform('min')
df['duration'] = (m1 - m2).dt.days