Multi-dimensional dataframe or multiple 2D dataframes - pandas

A colleague wrote some code to create a price lookup table for products where the prices change throughout the year. He also stores other information like the name of the season, when it starts, ends, etc. His code takes nine minutes to run on a beefy machine.
His approach is the traditional SQL-loop-over-records algorithms. I wanted to see if I could do better using matrices, so I wrote a price table (of only prices) using Pandas. My code runs in 21 seconds on a Macbook Air. Cool.
My next step is to add in other attributes like name of the season, when it starts, ends, etc. It's my understanding that I shouldn't store objects in my dataframes because that will reduce speed, is Bad Practice, etc.
I think I have two options: 1. for each new piece of data add another dimension so the shape of my dataframe would go from (product X days) to (product X days X season_name X season_start X season_end) or 2. I would just create a new dataframe for each attribute and jump back and forth between them as necessary.
My goal is to use pandas for very quick lookups and calculations of data.
Or is there a better more pandas-ish way to do this?

Related

Stata Create panel dataset with two dataframes, no common variable

I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.

Need to divide a Dataframe in various tables using multiple categories and date time

this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do
churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100
The Problem
I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach.
My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.

Running regressions iteratively for subsets of pySpark dataframes - partitioning by DF columns or mapPartitions?

I have sales data by store and product_category for every week in the following format.
STORE|PRODUCT_CAT|WK_ENDING|<PREDICTOR_VARIABLES>|TARGET_VARIABLE
S1|P1||2016-01-01|..|....
S1|P1||2016-01-08|..|....
S1|P1||2016-01-15|..|....
S1|P2||2016-01-01|..|....
S1|P2||2016-01-08|..|....
S1|P2||2016-01-15|..|....
S2|P1||2016-01-01|..|....
S2|P1||2016-01-08|..|....
S2|P1||2016-01-15|..|....
S2|P2||2016-01-01|..|....
S2|P2||2016-01-08|..|....
S2|P2||2016-01-15|..|....
...
...
As you can see it has multiple records by week for every Store - Product combination.
There could be about 200 different stores and ~50 different product categories i.e. we would have ~200 x ~50 = ~10,000 different Store - product combinations (say). For every such combination we will have data for about 4-5 years i.e. 250 records say.
Requirement is that we run separate regresssion models for each of the store-prod combinations.That means we need to run thousands of regressions but on very small datasets. What is the way to go about this?
Options tried / thought about -
1. Usual "FOR" loops -
Extracted the unique Store-category combinations and then for each store and for each cat (nested for loop), filtered the data from the above DF and ran the models.
The process runs for about 10-12 stores and then throws memory errors. Note that the above DF is persisted.
I have seen for other similar computations, pySpark is not able to handle for loops, if it has to reference the same DF from inside the For loop.
Following is the code snippet -
main_df.persist() # This is the master dataframe, containing all the above data that is persisted
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
for store in store_lst:
<some calculations like filtering the master dataframe by store etc.. >
main_df_by_store = main_df.filter(main_df['store_id']==str(store))
for cat in cat_lst:
assembler=VectorAssembler(inputCols=['peer_pos_sales'],outputCol='features')
traindata=main_df_by_store.filter(main_df_by_store['rbt_category']==str(cat))
output = assembler.transform(traindata)
modelfit=output.drop('peer_pos_sales').withColumnRenamed('vacant_pos_sales','label')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(modelfit)
result = lrModel.transform(modelfit)
Can we create a Window Function, partitioned by Store, Category and then apply a UDF to run the regressions?
However, it appears that we can only use built-in functions for Window functions, and not UDF? Is that correct?
How to handle this? Looping is killing the server.
This needs to be done in pySpark only.

Can you graph a time-series with continuously changing entries?

How do I graphically represent a time-series where entries of this graph change over time?
For example I have a database of cities and their corresponding average temperature per day. I want a graphical representation of the ten hottest cities per day, and how they change over time. Do new cities appear on this list? Which cities drop out of this list?
Normally 6/10 of these cities will always be on the “top ten hottest” list, but sometimes a particular entry may spike up and join the top ten list. Is there a way to analyze the top ten list and compare it over time?
I’m having trouble thinking of a way to graph this because of the varying entries.
Your x-axis is day, but what's on the y-axis? Temperature? If so, you can have a different series (may be called something else depending on your charting package) for each city, and just add points to the series when it is one of the top ten for that day. This may require you to do some pre-processing on your data, in order to figure out which set of cities makes the top-ten list over your time frame.
In one of our widget implementations we have a setting for the time chart to display only top-n series, ranked by avg or last value. This was done to remove clutter by hiding too many unimportant series from the chart.
In your case why not show a bar for each period where the bar would contain top-N series for the period and a grey area for the remainder?

track sales for week/month and find the best sellers

Lets say I have a website that sells widgets. I would like to do something similar to a tag cloud tracking best sellers. However, due to constantly aquiring and selling new widgets, I would like the sales to decay on a weekly time scale.
I'm having problems puzzling out how store and manipulate this data and have it decay properly over time so that something that was an ultra hot item 2 months ago but has since tapered off doesn't show on top of the list over the current best sellers. What would be the logic and database design for this?
Part 1: You have to have tables storing the data that you want to report on. Date/time sold is obviously key. If you need to work in decay factors, that raises the question: for how long is the data good and/or relevant? At what point in time as the "value" of the data decayed so much that you no longer care about it? When this point is reached for any given entry in the database, what do you do--keep it there but ensure it gets factored out of all subsequent computations? Or do you archive it--copy it to a "history" table and delete it from your main "sales" table? This is relevant, as it has to be factored into your decay formula (as well as your capacity planning, annual reporting requirements, and who knows what all else.)
Part 2: How much thought has been given to the decay formula that you want to use? There's no end of detail you can work into this. Options and factors to wade through include but are not limited to:
Simple age-based. Everything before the cutoff date counts as 1; everything after counts as 0. Sum and you're done.
What's the cutoff date? Precisly 14 days ago, to the minute? Midnight as of two Saturdays ago from (now)?
Does the cutoff date depend on the item that was sold? If some items are hot but some are not, does that affect things? What if you want to emphasize some things (the expensive/hard to sell ones) over others (the fluff you'd sell anyway)?
Simple age-based decays are trivial, but can be insufficient. Time to go nuclear.
Perhaps you want some kind of half-life, Dr. Freeman?
Everything sold is "worth" X, where the value of X is either always the same or varies on the item sold. And the value of X can decay over time.
Perhaps the value of X decreased by one-half every week. Or ever day. Or every month. Or (again) it may vary depending on the item.
If you do half-lifes, the value of X may never reach zero, and you're stuck tracking it forever (which is why I wrote "part 1" first). At some point, you probably need some kind of cut-off, some point after which you just don't care. X has decreased to one-tenth the intial value? Three months have passed? Either/or but the "range" depends on the inherent valud of the item?
My real point here is that how you calculate your decay rate is far more important than how you store it in the database. So long as the data's there that the formalu needs to do it's calculations, you should be good. And if you only need the last month's data to do this, you should perhaps move everything older to some kind of archive table.
you could just count the sales for the last month/week/whatever, and sort your items according to that.
if you want you can always add the total amonut of sold items into your formula.
You might have a table which contains the definitions of the pointing criterion (most sales, most this, most that, etc.), then for a given period, store in another table the attribution of points for each of the criterion defined in the criterion table. Obviously, a historical table will be used to store the score for each sellers for a given period or promotion, call it whatever you want.
Does it help a little?