I'm trying to perform specific operations based on the age of data in days within a dataframe. What I am looking for is something like as follows:
import pandas as pd
if 10days < (pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)):
print 'The data is older than 10 days'
Is there something I can replace "10days" with or some other way I can perform operations based on the difference between two Timestamp values?
What you're looking for is pd.Timedelta('10D'), pd.Timedelta(10, unit='D') (or unit='days' or unit='day'), or pd.Timedelta(days=10). For example,
In [37]: pd.Timedelta(days=10) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[37]: False
In [38]: pd.Timedelta(days=5) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[38]: True
Related
I have a dataframe date_dataframe in pyspark with Monthly frequency
date_dataframe
from_date, to_date
2021-01-01, 2022-01-01
2021-02-01, 2022-02-01
2021-03-01, 2022-03-01
Using the dataframe, I want to filter another dataframe having millions of records (daily frequency) by grouping them by id and aggregating to calculate average.
data_df
id,p_date,value
1, 2021-03-25, 10
1, 2021-03-26, 5
1, 2021-03-36, 7
2, 2021-03-25, 5
2, 2021-03-26, 7
2, 2021-03-36, 8
3, 2021-03-25, 20
3, 2021-03-26, 23
3, 2021-03-36, 17
.
.
.
10, 2022-03-25, 5
12, 2022-03-25, 6
I want to use date_dataframe to query (filter) data_df
Group by the filtered dataframe using ID
Finally aggregate to calculate the average value.
I have tried the below code to do this.
from functools import reduce
from pyspark.sql import DataFrame
SeriesAppend=[]
for row in date_dataframe:
df_new = data_df.filter((data_df.p_date >= row["from_date"]) & (data_df.p_date < row["to_date"])).groupBy("id").agg(min('p_date'), max('p_date'), F.avg('value') )
SeriesAppend.append(df_new)
df_series = reduce(DataFrame.unionAll, SeriesAppend)
Is there more optimized way to do this in pyspark without using for loop?
Also, date_dataframe is nothing but start of the month date as start date and end date is start date + 1 year. I am okay with having different format for date_dataframe.
You can use a sql function sequence to expand your ranges into actual date rows. Then you can use a join to complete the work. Here I changed the name of column to_date to end_date as to_date is a SQL function name and didn't want to deal with the hassle.
from pyspark.sql.functions import min, explode, col, expr
df_sequence = date_dataframe.select( \
explode( \
expr("sequence ( to_date(from_date),to_date(end_date), interval 1 day)").alias('day') ) )
df_sequence.join( data_df, data_df.p_date == df_sequence.day, "left")\
.groupby( blah, blah blah...
This should parallelize the work instead of using a for loop.
Using the .resample() method yields a DataFrame with a DatetimeIndex and a frequency.
Does anyone have an idea on how to iterate through the values of that DatetimeIndex ?
df = pd.DataFrame(
data=np.random.randint(0, 10, 100),
index=pd.date_range('20220101', periods=100),
columns=['a'],
)
df.resample('M').mean()
If you iterate, you get individual entries taking the Timestamp(‘2022-11-XX…’, freq=‘M’) form but I did not manage to get the date only.
g.resample('M').mean().index[0]
Timestamp('2022-01-31 00:00:00', freq='M')
I am aiming at feeding all the dates in a list for instance.
Thanks for your help !
You an convert each entry in the index into a Datetime object using .date and to a list using .tolist() as below
>>> df.resample('M').mean().index.date.tolist()
[datetime.date(2022, 1, 31), datetime.date(2022, 2, 28), datetime.date(2022, 3, 31), datetime.date(2022, 4, 30)]
You can also truncate the timestamp as follows (reference solution)
>>> df.resample('M').mean().index.values.astype('<M8[D]')
array(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[D]')
This solution seems to work fine both for dates and periods:
I = [k.strftime('%Y-%m') for k in g.resample('M').groups]
I have a dataframe which shall be grouped and then on each group several functions shall be applied. Normally, I would do this with groupby().agg() (cf. Apply multiple functions to multiple groupby columns), but the functions I'm interested do not need one column as input but multiple columns.
I learned that, when I have one function that has multiple columns as input, I need apply (cf. Pandas DataFrame aggregate function using multiple columns).
But what do I need, when I have multiple functions that have multiple columns as input?
import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10], 'y':[10, 13, 20, 30], 'id':['a', 'a', 'b', 'b']})
def mindist(data): #of course these functions are more complicated in reality
return min(data['y'] - data['x'])
def maxdist(data):
return max(data['y'] - data['x'])
I would expect something like df.groupby('id').apply([mindist, maxdist])
min max
id
a 8 10
b 30 40
(achieved with pd.DataFrame({'mindist':df.groupby('id').apply(mindist),'maxdist':df.groupby('id').apply(maxdist)} - which obviously isn't very handy if I have a dozend of functions to apply on the grouped dataframe). Initially I thought this OP had the same question, but he seems to be fine with aggregate, meaning his functions take only one column as input.
For this specific issue, how about groupby after difference?
(df['x']-df['y']).groupby(df['id']).agg(['min','max'])
More generically, you could probably do something like
df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))
IIUC you want to use several functions within the same group. In this case you should return a pd.Series. In the following toy example I want to
sum columns A and B then calculate the mean
sum columns C and D then calculate the std
import pandas as pd
df = pd.util.testing.makeDataFrame().head(10)
df["key"] = ["key1"] * 5 + ["key2"] * 5
def fun(x):
m = (x["A"]+x["B"]).mean()
s = (x["C"]+x["D"]).std()
return pd.Series({"meanAB":m, "stdCD":s})
df.groupby("key").apply(fun)
Update
Which in your case became
import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10],
'y':[10, 13, 20, 30],
'id':['a', 'a', 'b', 'b']})
def mindist(data): #of course these functions are more complicated in reality
return min(data['y'] - data['x'])
def maxdist(data):
return max(data['y'] - data['x'])
def fun(data):
return pd.Series({"maxdist":maxdist(data),
"mindist":mindist(data)})
df.groupby('id').apply(fun)
I am trying to filter my data down to only those rows in the bottom decile of the data for any given date. Thus, I need to groupby the date first to get the sub-universe of data and then from there filter that same sub-universe down to only those values falling in the bottom decile. I then need to aggregate all of the different dates back together to make one large dataframe.
For example, I want to take the following df:
df = pd.DataFrame([['2017-01-01', 1], ['2017-01-01', 5], ['2017-01-01', 10], ['2018-01-01', 5], ['2018-01-01', 10]], columns=['date', 'value'])
and only those rows where the value is in the bottom decile for that date (below 1.8 and 5.5, respectively):
date value
0 '2017-01-01' 1
1 '2018-01-01' 5
I can get a series of the bottom decile using df.groupby(['date'], 'value'].quantile(.1), but this would then require me to iterate through the entire df and compare the value to the quantile value in the series, which I'm trying to avoid due to performance issues.
Something like this?
df.groupby('date').value.apply(lambda x: x[x < x.quantile(.1)]).reset_index(1,drop = True).reset_index()
date value
0 2017-01-01 1
1 2018-01-01 5
Edit:
df.loc[df['value'] < df.groupby('date').value.transform(lambda x: x.quantile(.1))]
I want to compute the time difference between times in a DateTimeIndex
import pandas as pd
p = pd.DatetimeIndex(['1985-11-14', '1985-11-28', '1985-12-14', '1985-12-28'], dtype='datetime64[ns]')
I can compute the time difference of two times:
p[1] - p[0]
gives
Timedelta('14 days 00:00:00')
But p[1:] - p[:-1] doesn't work and gives
DatetimeIndex(['1985-12-28'], dtype='datetime64[ns]', freq=None)
and a future warning:
FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()
Any thought on how how I can (easily) compute the time difference between values in a DateTimeIndex? And why does it work for 1 value, but not for the entire DateTimeIndex?
Convert the DatetimeIndex to a Series using to_series() and then call diff to calculate inter-row differences:
In [5]:
p.to_series().diff()
Out[5]:
1985-11-14 NaT
1985-11-28 14 days
1985-12-14 16 days
1985-12-28 14 days
dtype: timedelta64[ns]
As to why it failed, the - operator here is attempting to perform a set difference or intersection of your different index ranges, you're trying to subtract the values from one range with another which diff does.
when you did p[1] - p[0] the - is performing a scalar subtraction but when you do this on an index it thinks that you're perform a set operation
The - operator is working, it's just not doing what you expect. In the second situation it is acting to give the difference of the two datetime indices, that is the value that is in p[1:] but not in p[:-1]
There may be a better solution, but it would work to perform the operation element wise:
[e - k for e,k in zip(p[1:], p[:-1])]
I used None to fill the first difference value, but I'm sure you can figure out how you would like to deal with that case.
>>> [None] + [p[n] - p[n-1] for n in range(1, len(p))]
[None,
Timedelta('14 days 00:00:00'),
Timedelta('16 days 00:00:00'),
Timedelta('14 days 00:00:00')]
BTW, to just get the day difference:
[None] + [(p[n] - p[n-1]).days for n in range(1, len(p))]
[None, 14, 16, 14]