Running regressions iteratively for subsets of pySpark dataframes - partitioning by DF columns or mapPartitions? - dataframe

I have sales data by store and product_category for every week in the following format.
STORE|PRODUCT_CAT|WK_ENDING|<PREDICTOR_VARIABLES>|TARGET_VARIABLE
S1|P1||2016-01-01|..|....
S1|P1||2016-01-08|..|....
S1|P1||2016-01-15|..|....
S1|P2||2016-01-01|..|....
S1|P2||2016-01-08|..|....
S1|P2||2016-01-15|..|....
S2|P1||2016-01-01|..|....
S2|P1||2016-01-08|..|....
S2|P1||2016-01-15|..|....
S2|P2||2016-01-01|..|....
S2|P2||2016-01-08|..|....
S2|P2||2016-01-15|..|....
...
...
As you can see it has multiple records by week for every Store - Product combination.
There could be about 200 different stores and ~50 different product categories i.e. we would have ~200 x ~50 = ~10,000 different Store - product combinations (say). For every such combination we will have data for about 4-5 years i.e. 250 records say.
Requirement is that we run separate regresssion models for each of the store-prod combinations.That means we need to run thousands of regressions but on very small datasets. What is the way to go about this?
Options tried / thought about -
1. Usual "FOR" loops -
Extracted the unique Store-category combinations and then for each store and for each cat (nested for loop), filtered the data from the above DF and ran the models.
The process runs for about 10-12 stores and then throws memory errors. Note that the above DF is persisted.
I have seen for other similar computations, pySpark is not able to handle for loops, if it has to reference the same DF from inside the For loop.
Following is the code snippet -
main_df.persist() # This is the master dataframe, containing all the above data that is persisted
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
for store in store_lst:
<some calculations like filtering the master dataframe by store etc.. >
main_df_by_store = main_df.filter(main_df['store_id']==str(store))
for cat in cat_lst:
assembler=VectorAssembler(inputCols=['peer_pos_sales'],outputCol='features')
traindata=main_df_by_store.filter(main_df_by_store['rbt_category']==str(cat))
output = assembler.transform(traindata)
modelfit=output.drop('peer_pos_sales').withColumnRenamed('vacant_pos_sales','label')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(modelfit)
result = lrModel.transform(modelfit)
Can we create a Window Function, partitioned by Store, Category and then apply a UDF to run the regressions?
However, it appears that we can only use built-in functions for Window functions, and not UDF? Is that correct?
How to handle this? Looping is killing the server.
This needs to be done in pySpark only.

Related

How to take sample from dask dataframe having all the products ordered by certain number of customers alone?

I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.
I have data of 200k customers.
This is the code I have written for sampling
Data is loading quickly when using a dask dataframe but takes much time for queries.
df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)
This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?
Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.
df = pd.read_csv('customers.csv')
df = df.sample(frac=.2) # 20% of the rows will be sampled.
df.to_csv('sample_customers.csv') # Create an easier to work with .csv
Generally the format of a question on here is
Description of problem
Desired outcome
What you've tried
Minimum reproducible example

calculate count by pandas and display/filter in pwoerBI

I have a dataset that is about the user action logging on web page. Due to the whole dataset is huge, I want to use pandas to calculate the count first, and then visualize the count data in powerBI. And the important thing is that i can filter the data

Stata Create panel dataset with two dataframes, no common variable

I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.

Moving files from one parquet partition to another

I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE
such that the file structure of my parquets are:
s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01
I have 7 MODULE and the DATE ranges from 2020-01-01 to 2020-09-01.
I found a discrepancy in the data and need to correct the MODULE entries for one of the module. Basically I need to change all data for a particular index number, belonging to MODULE XYZ to MODULE ABC.
I can do this in pyspark by loading the data frame and doing something like:
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
But how do I repartition it so that only those entries that are changed get moved to the ABC MODULE partition? If I do something like:
df.mode('append').partitionBy('MODULE','DATE').parquet(s3://my_bucket/path/file.parquet")
I would be adding the data along with the erroneous MODULE data. Plus, I have almost a years worth of data and don't want to repartition the entire dataset as it would take a very long time.
Is there a way to do this?
If I understand well, you have data in partition MODULE=XYZ that should be moved to MODULE=ABC.
First, identify the impacted files.
from pyspark.sql import functions as F
file_list = df.where(F.col("index") == 34).select(
F.input_file_name()
).distinct().collect()
Then, you create a dataframe based only on theses files, you use it to complete both MODULE.
df = spark.read.parquet(file_list).withColumn(
"MODULE", when(col("index") == 34, "ABC").otherwise(col("MODULE"))
)
df.write.parquet(
"s3://my_bucket/path/ABC/", mode="append", partitionBy=["MODULE", "DATE"]
)
At this point, ABC should be OK (you just added the missing data), but XYZ should be wrong because of duplicate data. To recover XYZ, you just need to delete the list of files in file_list.
IIUC you can do this by filtering the data for that particular index then save that data with date as partition.
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
df = df.filter(col('index')==34)
df.mode('overwrite').partitionBy('DATE').parquet(s3://my_bucket/path/ABC/")
In this way you will only end up modifying only the changed module i.e. ABC

Multi-dimensional dataframe or multiple 2D dataframes

A colleague wrote some code to create a price lookup table for products where the prices change throughout the year. He also stores other information like the name of the season, when it starts, ends, etc. His code takes nine minutes to run on a beefy machine.
His approach is the traditional SQL-loop-over-records algorithms. I wanted to see if I could do better using matrices, so I wrote a price table (of only prices) using Pandas. My code runs in 21 seconds on a Macbook Air. Cool.
My next step is to add in other attributes like name of the season, when it starts, ends, etc. It's my understanding that I shouldn't store objects in my dataframes because that will reduce speed, is Bad Practice, etc.
I think I have two options: 1. for each new piece of data add another dimension so the shape of my dataframe would go from (product X days) to (product X days X season_name X season_start X season_end) or 2. I would just create a new dataframe for each attribute and jump back and forth between them as necessary.
My goal is to use pandas for very quick lookups and calculations of data.
Or is there a better more pandas-ish way to do this?