PySpark filter using another dataframe having to and from date and group by using ids - dataframe

I have a dataframe date_dataframe in pyspark with Monthly frequency
date_dataframe
from_date, to_date
2021-01-01, 2022-01-01
2021-02-01, 2022-02-01
2021-03-01, 2022-03-01
Using the dataframe, I want to filter another dataframe having millions of records (daily frequency) by grouping them by id and aggregating to calculate average.
data_df
id,p_date,value
1, 2021-03-25, 10
1, 2021-03-26, 5
1, 2021-03-36, 7
2, 2021-03-25, 5
2, 2021-03-26, 7
2, 2021-03-36, 8
3, 2021-03-25, 20
3, 2021-03-26, 23
3, 2021-03-36, 17
.
.
.
10, 2022-03-25, 5
12, 2022-03-25, 6
I want to use date_dataframe to query (filter) data_df
Group by the filtered dataframe using ID
Finally aggregate to calculate the average value.
I have tried the below code to do this.
from functools import reduce
from pyspark.sql import DataFrame
SeriesAppend=[]
for row in date_dataframe:
df_new = data_df.filter((data_df.p_date >= row["from_date"]) & (data_df.p_date < row["to_date"])).groupBy("id").agg(min('p_date'), max('p_date'), F.avg('value') )
SeriesAppend.append(df_new)
df_series = reduce(DataFrame.unionAll, SeriesAppend)
Is there more optimized way to do this in pyspark without using for loop?
Also, date_dataframe is nothing but start of the month date as start date and end date is start date + 1 year. I am okay with having different format for date_dataframe.

You can use a sql function sequence to expand your ranges into actual date rows. Then you can use a join to complete the work. Here I changed the name of column to_date to end_date as to_date is a SQL function name and didn't want to deal with the hassle.
from pyspark.sql.functions import min, explode, col, expr
df_sequence = date_dataframe.select( \
explode( \
expr("sequence ( to_date(from_date),to_date(end_date), interval 1 day)").alias('day') ) )
df_sequence.join( data_df, data_df.p_date == df_sequence.day, "left")\
.groupby( blah, blah blah...
This should parallelize the work instead of using a for loop.

Related

How can I delete a group of rows if they don't satisfy a condition?

I have a dataframe with stock option information. I want to filter this dataframe in order to have exactly 8 options per date. The problem is that some dates have only 6 or 7 options. I want to write a code where I delete entirely this group of options.The option dataframe that I want to filter
Take this small dataframe as an example:
dates = ['2013-01-01','2013-01-01','2013-01-01','2013-01-02','2013-01-02','2013-01-03','2013-01-03','2013-01-03']
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=list('ABCD'))
In this particular case I want to drop the rows indexed in date '2013-01-02' since I only want dates who have 3 consecutive rows.
First group by count on index
odf = df.groupby(df.index).count()
filter the dataframe and get the resulting index
idx = odf[odf['A'] == 3].index
select by index
df.loc[idx]

Create datetime from columns in a DataFrame

I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.

pandas group by week

I have the following test dataframe:
date user answer
0 2018-08-19 19:08:19 pga yes
1 2018-08-19 19:09:27 pga no
2 2018-08-19 19:10:45 lry no
3 2018-09-07 19:12:31 lry yes
4 2018-09-19 19:13:07 pga yes
5 2018-10-22 19:13:20 lry no
I am using the following code to group by week:
test.groupby(pd.Grouper(freq='W'))
I'm getting an error that Grouper is only valid with DatetimeIndex, however I'm unfamiliar on how to structure this in order to group by week.
Probably you have date column as a string.
In order to use it in a Grouper with a frequency, start from converting this column to DateTime:
df['date'] = pd.to_datetime(df['date'])
Then, as date column is an "ordinary" data column (not the index), use key='date' parameter and a frequency.
To sum up, below you have a working example:
import pandas as pd
d = [['2018-08-19 19:08:19', 'pga', 'yes'],
['2018-08-19 19:09:27', 'pga', 'no'],
['2018-08-19 19:10:45', 'lry', 'no'],
['2018-09-07 19:12:31', 'lry', 'yes'],
['2018-09-19 19:13:07', 'pga', 'yes'],
['2018-10-22 19:13:20', 'lry', 'no']]
df = pd.DataFrame(data=d, columns=['date', 'user', 'answer'])
df['date'] = pd.to_datetime(df['date'])
gr = df.groupby(pd.Grouper(key='date',freq='W'))
for name, group in gr:
print(' ', name)
if len(group) > 0:
print(group)
Note that the group key (name) is the ending date of a week, so dates from group members are earlier or equal to the date printed above.
You can change it passing label='left' parameter to Grouper.

ArcPy & Python - Get Latest TWO dates, grouped by Value

I've looked around for the last week to an answer but only see partial answers. Being new to python, I really could use some assistance.
I have two fields in a table [number] and [date]. The date format is date and time, so: 07/09/2018 3:30:30 PM. The [number] field is just an integer, but each row may have the same number.
I have tried a few options to gain access to the LATEST date, and I can get these using Pandas:
myarray = arcpy.da.FeatureClassToNumPyArray (fc, ['number', 'date'])
mydf = pd.DataFrame(myarray)
date_index = mydf.groupby(['number'])['date'].transform(max)==mydf['date']
However, I need the latest TWO dates. I've moved on to trying an "IF" statement because I feel arcpy.da.UpdateCursor is better suited to look through the record and update another field by grouping by NUMBER and returning the rows with the latest TWO dates.
End result would like to see the following table grouped by number, latest two dates (as examples):
Number : Date
1 7/29/2018 4:30:44 PM
1 7/30/2018 5:55:34 PM
2 8/2/2018 5:45:23 PM
2 8/3/2018 6:34:32 PM
Try this.
import pandas as pd
import numpy as np
# Some data.
data = pd.DataFrame({'number': np.random.randint(3, size = 15), 'date': pd.date_range('2018-01-01', '2018-01-15')})
# Look at the data.
data
Which gives some sample data like this:
So in our output we'd expect to see number 0 with the 5th and the 9th, 1 with the 14th and 15th, and 2 with the 6th and the 12th.
Then we group by number, grab the last two rows, and set and sort the index.
# Group and label the index.
last_2 = data.groupby('number').tail(2).set_index('number').sort_index()
last_2
Which gives us what we expect.

Pandas groupby on one column and then filter based on quantile value of another column

I am trying to filter my data down to only those rows in the bottom decile of the data for any given date. Thus, I need to groupby the date first to get the sub-universe of data and then from there filter that same sub-universe down to only those values falling in the bottom decile. I then need to aggregate all of the different dates back together to make one large dataframe.
For example, I want to take the following df:
df = pd.DataFrame([['2017-01-01', 1], ['2017-01-01', 5], ['2017-01-01', 10], ['2018-01-01', 5], ['2018-01-01', 10]], columns=['date', 'value'])
and only those rows where the value is in the bottom decile for that date (below 1.8 and 5.5, respectively):
date value
0 '2017-01-01' 1
1 '2018-01-01' 5
I can get a series of the bottom decile using df.groupby(['date'], 'value'].quantile(.1), but this would then require me to iterate through the entire df and compare the value to the quantile value in the series, which I'm trying to avoid due to performance issues.
Something like this?
df.groupby('date').value.apply(lambda x: x[x < x.quantile(.1)]).reset_index(1,drop = True).reset_index()
date value
0 2017-01-01 1
1 2018-01-01 5
Edit:
df.loc[df['value'] < df.groupby('date').value.transform(lambda x: x.quantile(.1))]