I am trying to mask a dataset I have based on two parameters
That I mask any station has a repeat value more than once per hour
a) I want the count to reset once the clock hits a
new hour
That I mask the data whenever the previous datapoint is lower than the current datapoint if it's within the hour and the station names are the same.
The mask I applied to it is this
mask = (df['station'] == df['station'].shift(1))
& (df['precip'] >= df['precip'].shift(1))
& (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour'))
df.loc[mask, 'to_remove'] = True
However it is not working properly giving me a df that looks like this
I want a dataframe that looks like this
Basically you want to mask two things, the first being a duplicate value per station & hour. This can be found by grouping by station and the hour of valid, plus the precip column. On this group by you can count the number of occurances and check if it is more than one:
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour), df.precip] #groupby the columns
).transform('count') > 1 #count the values and check if more than 1
The second one is not clear to me whether you want to reset once the clock hits a new hour (as mentioned in the first part of the mask). If this is also the case, you need to group by station and hour, and check values using shift (as you tried):
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour)] #group by station and hour
).precip.apply(lambda x: x < x.shift(1)) #value lower than previous
If this is not the case, as suggested by your expected output, you only group by station:
df.groupby(df.station).precip.apply(
lambda x: ( x < x.shift(1) ) & #value is less than previous
( abs(df['valid'] - df['valid'].shift(1) ) < pd.Timedelta('1 hour') ) #within hour
)
Combining these two masks will let you drop the right rows:
df['sameValueWithinHour'] = df.groupby([df.station, df['valid'].apply(lambda x: x.hour), df.precip]).transform('count')>1
df['previousValuelowerWithinHour'] = df.groupby(df.station).precip.transform(lambda x: (x<x.shift(1)) & (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour')))
df['to_remove'] = df.sameValueWithinHour | df.previousValuelowerWithinHour
df.loc[~df.to_remove, ['station', 'valid', 'precip']]
station valid precip
2 btv 2022-02-23 00:55:00 0.5
4 btv 2022-02-23 01:12:00 0.3
I have a column in my dataframe that goes indexwise from 900 to 1200.
Now I want to extract that index values automatically. Does anyone know how I could do that? I either would like to have the difference (1200-900) or the range [900,1200].
IIUC use:
out = (df.index.min(), df.index.max())
dif = df.index.max() - df.index.min()
#1200 missing
r = range(df.index.min(), df.index.max())
#include 1200
r1 = range(df.index.min(), df.index.max() + 1)
I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!
I can create an index vs the previous year when I have just one item, but I'm trying to figure out how to do this when I have multiple items. Here is my data set:
rng = pd.date_range('1/1/2011', periods=3, freq='Y')
rng = np.repeat(rng,3)
country = ["USA","Brazil","Japan"]*3
df = pd.DataFrame({'Country':country,'date':rng,'value':range(20,29)})
If I only had one item/country I can do something like this:
df['pct_iya'] = 100*(df['value'].pct_change()+1)
I'm trying to get this to work with multiple items. Here is the expected result:
Maybe this could work with a groupby, but my attempt did not work...
df['pct_iya2'] = df.groupby(['Country','date'])['value'].pct_change()
Answer: Use a group by (excluding date) and than add one to the percent change (ex +15percent goes from .15 to 1.15), then multiple you 100.
df['pct_iya'] = 100*(df.groupby(['Country'])['value'].pct_change()+1)