Pandas UDF Facebook Prophet / multiple parameters - pandas

I'm trying to scale multiple models with Facebook Prophet and Pandas UDF on spark.
Everything works fine but I'd like to refine the models by giving different parameters to the function.
The function is grouped on the ID column of my dataset which is a combination of country and product.
I would like the function to apply country specific holiday to the model, added to a general seasonality dataframe which I use to for example to remove COVID19 impact on the data. Eventually I would like to change any other parameter (e.g. different type of growth) depending on the ID value.
Thank you for your kind help.

The way I think I solved it is by adding another column in the training dataset and then point to the first value of that column for each respective model ID.
So for example if the data has daily data points for the different IDs if the IDs is related to the US country, the new column points to this value for country level seasonality.
day, id, value, country
4/1, US-Item1, 10, US
4/1, IT-Item1, 5, IT
4/1, US-Item2, 15, US

Related

Stata Create panel dataset with two dataframes, no common variable

I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.

How to populate all possible combination of values in columns, using Spark/normal SQL

I have a scenario, where my original dataset looks like below
Data:
Country,Commodity,Year,Type,Amount
US,Vegetable,2010,Harvested,2.44
US,Vegetable,2010,Yield,15.8
US,Vegetable,2010,Production,6.48
US,Vegetable,2011,Harvested,6
US,Vegetable,2011,Yield,18
US,Vegetable,2011,Production,3
Argentina,Vegetable,2010,Harvested,15.2
Argentina,Vegetable,2010,Yield,40.5
Argentina,Vegetable,2010,Production,2.66
Argentina,Vegetable,2011,Harvested,15.2
Argentina,Vegetable,2011,Yield,40.5
Argentina,Vegetable,2011,Production,2.66
Bhutan,Vegetable,2010,Harvested,7
Bhutan,Vegetable,2010,Yield,35
Bhutan,Vegetable,2010,Production,5
Bhutan,Vegetable,2011,Harvested,2
Bhutan,Vegetable,2011,Yield,6
Bhutan,Vegetable,2011,Production,3
Image of the above csv:
Now there is a very small country lookup table which has all possible countries the source data can come with, listed. PFB:
I want to have the output data's number of columns always fixed (this is to ensure the reporting/visualization tool doesn't get dynamic number columns with every day's new source data ingestions depending on the varying distinct number of countries present).
So, I've to somehow join the source data with the country_lookup csv and populate all those columns with default value as F. Every country column would be binary with T or F being the possible values.
The original dataset from the above has to be converted into below:
Data (I've kept the Amount field unsolved for column Type having Derived Yield as is, rather than calculating them below for a better understanding and for you to match with the formulae):
Country,Commodity,Year,Type,Amount,US,Argentina,Bhutan,India,Nepal,Bangladesh
US,Vegetable,2010,Harvested,2.44,T,F,F,F,F,F
US,Vegetable,2010,Yield,15.8,T,F,F,F,F,F
US,Vegetable,2010,Production,6.48,T,F,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
US,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
US,Vegetable,2011,Harvested,6,T,F,F,F,F,F
US,Vegetable,2011,Yield,18,T,F,F,F,F,F
US,Vegetable,2011,Production,3,T,F,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
US,Vegetable,2011,Derived Yield,(6+2)/(3+3),T,F,T,F,F,F
US,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Argentina,Vegetable,2010,Harvested,15.2,F,T,F,F,F,F
Argentina,Vegetable,2010,Yield,40.5,F,T,F,F,F,F
Argentina,Vegetable,2010,Production,2.66,F,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2)/(6.48+2.66),T,T,F,F,F,F
Argentina,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Argentina,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Argentina,Vegetable,2011,Harvested,10,F,T,F,F,F,F
Argentina,Vegetable,2011,Yield,90,F,T,F,F,F,F
Argentina,Vegetable,2011,Production,9,F,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10)/(3+9),T,T,F,F,F,F
Argentina,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Argentina,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
Bhutan,Vegetable,2010,Harvested,7,F,F,T,F,F,F
Bhutan,Vegetable,2010,Yield,35,F,F,T,F,F,F
Bhutan,Vegetable,2010,Production,5,F,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(15.2+7)/(2.66+5),F,T,T,F,F,F
Bhutan,Vegetable,2010,Derived Yield,(2.44+15.2+7)/(6.48+2.66+5),T,T,T,F,F,F
Bhutan,Vegetable,2011,Harvested,2,F,F,T,F,F,F
Bhutan,Vegetable,2011,Yield,6,F,F,T,F,F,F
Bhutan,Vegetable,2011,Production,3,F,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(2.44+7)/(6.48+5),T,F,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(10+2)/(9+3),F,T,T,F,F,F
Bhutan,Vegetable,2011,Derived Yield,(6+10+2)/(3+9+3),T,T,T,F,F,F
The image of the above expected output data for a structured look at it:
Part 1 -
Part 2 -
Formulae for populating Amount Field for Derived Type:
Derived Amount = Sum of Harvested of all countries with T (True) grouped by Year and Commodity columns divided by Sum of Production of all countries with T (True)grouped by Year and Commodity columns.
So, the target is to have a combination of all the countries from source and calculate the sum of respective Harvested and Production values which then has to be divided. The commodity can be more than one in the actual scenario for any given country, but that should not bother as the summation of amount happens on grouped commodity and year.
Note: The users in the frontend can select any combination of countries. The sole purpose of doing it in the backend rather than dynamically doing it in the frontend is because AWS QuickSight (our visualisation tool), even though can populate sum on selected column filters but doesn't yet support calculation on those derived summed fields. Hence, the entire calculation of all combination of countries has to be pre-populated (very naive approach) in order to make it available in report on dynamic users selection of countries.
Also if you've any better approach (than the above naive approach mentioned in note) to solve this problem, you are most welcome to guide me. I've also posted a question on the same problem without writing my expected approach for experts to show me the path on how we can solve this kind of a problem better than this naive approach. If you want to help solve it with some other technique, you're most welcome, here is the link to that question.
Any help shall be greatly acknowledged.

Extracting multiple values from the same column, X, given their shared value in column Y

Before everything, I would like to say thanks for your help! I am currently working on data provided by UK rail; the data I have is currently contained in a relational database on PostgreSQL. My ultimate goal is to extract origin-destination pairs that are served by the railway network.
Right now, entries are stored in rows that each contain one stop/location in a train's journey. For instance, if a train travels from D to G via E and F, then D will be contained in a row and marked as an origin_location, E and F as an intermediate_location and G as a terminating_location. Each of them will also be marked with additional details such as train ID, time of the departure/passage/arrival and date of travel.
Eventually, from these entries I hope to write a query that gives me all possible origin-destination pairs. So for instance, continuing the example above, I would like the query to produce a table that gives me rows as such: D-E; D-F; D-G; E-F; E-G; F-G. The linkages between these locations are based on them sharing a common train identifier that is in another column, "Column Y".
I am at the very start of my research, but I don't expect you to help me find a specific code as a solution. Rather, I just hope you could point me in the direction as to which functions I could read up into using, and we'll see where it goes from there.
Someone has tried to work with the same dataset, albeit building this function with a slightly different purpose - that is, to get the full timetable of all trains running on a particular day (with each train departure/intermediate/terminating location as a row). The function is detailed here: https://github.com/jhumphry/ukraildata_etl/blob/master/sql/mca_get_full_timetable.sql
demo:db<>fiddle
You need a self-join on an ordered stop id (hope you have such a column). You have to join every journey with itself but only with the following stops:
SELECT
s1.journey_id,
s1.stop_name,
s2.stop_name
FROM
schedule s1
JOIN schedule s2 ON s1.journey_id = s2.journey_id AND s1.stop_id < s2.stop_id

Qlikview: Total of calculated metric based on calculated dimension

I started working on Qlikview a week back and I am working on this dashboard.
I have a particular requirement which I am not able to achieve:
So, I have a calculated dimension "Categories" added in my script which based on certain conditions tags each name as SLEEPERS,STARS,WEAKLINKS etc.
Now, I have flagged the names based on certain condition which works fine.
The issue is, I want the sum of those flags on the level of calculated dimension CATEGORIES(SLEEPERS, STARS..etc) and my month field.
I am not able to achieve it because, the flag itself is a calculated field so sum of calculated field doesn't work. I tried using aggr, function but it returns zero for all rows. I am not sure why. in the aggr function I use the sum(Aggr(Flag,MONTH,Categories))
Can someone suggest a work around for this? I have attached the screenshot of the report for better understanding of the requirement

Two Dimensional Diagram with aggr function

I'm having a very curious Problem in QlikView.
I have a number of readouts from a Database which show certain amounts of time in a different state.
In that table there are 49 variables that describe the state, there are 7 levels of i.e the SOC and seven states of the Temperature.
i.e the one of the fields could be named: SOC1_T1 or SOC2_T1 and so on...
So what i get is a table full of readouts in which every i have an specific id for the object, the state of the variables and an age. There are multiple entries per Object.
What i want to do is to plot a two dimensional diagram over all the states so i get SOC over Temperatur Histogram(Average of the maximum (or newest) value of every object).
I tried creating to Dynamic (or syntethic) Dimensions (ValueLoop(1,7) and ValueLoop(1,8).
In the formulas i reffered to them with
=If(ValueLoop(1,7) = 1 and ValueLoop(1,8) = 1,
(avg(aggr(FirstSortedValue (SOC1_T1
, -age), id)) * 100))
and created 49 Formulas with each state variable output.
Problem now is:
It only shows the first entry. I can replace the whole expression in the if condition with a specific number (100) and get a result. I also plotted the inner expression into a Listbox and checked wheter the result is not null.
As soon as I delete the aggr function and just take the AVG over everything (which is not what i want). Everything works fine. When i turn back to aggr, only the first one is shown.
Doesnt help by the way when i delete one of the dimensions, this doesnt work one dimensional either.
Any ideas or workarounds?
Greetings
Julian