forecasting with multiple features using ARIMA - data-science

I have a problem where I want to forecast one value but I have multiple time series feature but target will be one of the feature. I have a tried ARIMA with single time series. Is there a way we can forecast with multiple input time series feature and single time series value as forecast-ed output?

You can add exogenous features in ARIMA. That's what the "X" means in "SARIMAX". I don't know what programming language you are using, but you can do that in both Python/statsmodels and R. See https://otexts.com/fpp3/regarima.html

Related

Insert ceros instead of interopolate ARIMA_PLUS bigquery

I want to do ARIMA_plus forecasting on a series of sale records. The problem is that sale records only contain sales. When doing the forecast we need to insert for every product the "non sales", which, essentially, are rows with the import column set to cero for every day the product has not been sold. We have here two options:
Fill the database with those zero-rows (uses a lot of space)
When doing the forecasting with ARIMA_PLUS in bigquery tell the model to fill with zeros instead of interpolating (default and seemingly unique option).
I want to follow the second option, yet, i dont see how. Here you can see a screenshot of the documentation Google info about interpolation
The first option would be carried out with a merge, nevertheless I would prefer to discard it since it increases the size of the sales table.
I have scanned the documentation and havent seen any solution
You need to provide an input dataset covering the missing values with the right method for your use case.
In other words, the SQL query must solve the interpolation so that the input for the model already contains the expected data.
You can, for example, create a query to add a liner interpolation solution for your use case.
So, the first approach you mentioned can be solved using that input SQL (rather than adding the data to the source table) and the second approach is not valid in bigquery, as far as I know.
Here you have an example: https://justrocketscience.com/post/interpolation_sql/

Stata Create panel dataset with two dataframes, no common variable

I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.

Commodity Term Structure Data from Bloomberg

I am looking for a way to download daily WTI crude oil (CL) term structure data from Bloomberg in the Excel Add-in.
My goal is to create a table that looks like this: For each futures contract, I want the lat price and the days to maturity.
Date
CL1
CL2
...
1.1.12
PX_Last
FUT_ACT_DAYS_EXP
PX_Last
FUT_ACT_DAYS_EXP
I have tried the FUT_ACT_DAYS_EXP code but for historical price series it is not available. Is there maybe a different way to receive the term structure data with days to maturity from BB?
Many thanks!
The generic contract CL1 Comdty has a historic field of FUT_CUR_GEN_TICKER. So you can pull back a timeseries of PX_LAST and FUT_CUR_GEN_TICKER.
Then you can feed these underlying contract tickers into a BDP call for LAST_TRADEABLE_DT and subtract the PX_LAST date. You can hide the intermediate columns if they are not needed.
And the final result, with the intermediate column D hidden:
NB. I'm using array functions here (note the # symbols in the formulae), rather than hard-coding ranges. It makes it more flexible if you want to change the history range. The multiple BDP calls are unnecessary but the Bloomberg addin may be caching them in any case. If performance is an issue you can use the UNIQUE() function to get a list of the underlying contract names into a lookup table.
It's been a few years since I looked at this but you are using generic tickers, whereas FUT_ACT_DAYS_EXP would most likely expect an actual contract, e.g. CLU1. There is a field that you can use to convert generic to actual as a time series, i.e. with BDH, but you would have to check that on FLDS. Once you have that you can use BDH to pull in price and days to expiry. Bear in mind that with one BDH per date this would be a very inefficient approach with regards to limits .
Alternatively ask HELP HELP and they should give you a solid answer. Unfortunately I don't access to the terminal anymore but I used to be very involved in building such analytics.

Need to divide a Dataframe in various tables using multiple categories and date time

this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do
churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100
The Problem
I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach.
My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.

Multi-dimensional dataframe or multiple 2D dataframes

A colleague wrote some code to create a price lookup table for products where the prices change throughout the year. He also stores other information like the name of the season, when it starts, ends, etc. His code takes nine minutes to run on a beefy machine.
His approach is the traditional SQL-loop-over-records algorithms. I wanted to see if I could do better using matrices, so I wrote a price table (of only prices) using Pandas. My code runs in 21 seconds on a Macbook Air. Cool.
My next step is to add in other attributes like name of the season, when it starts, ends, etc. It's my understanding that I shouldn't store objects in my dataframes because that will reduce speed, is Bad Practice, etc.
I think I have two options: 1. for each new piece of data add another dimension so the shape of my dataframe would go from (product X days) to (product X days X season_name X season_start X season_end) or 2. I would just create a new dataframe for each attribute and jump back and forth between them as necessary.
My goal is to use pandas for very quick lookups and calculations of data.
Or is there a better more pandas-ish way to do this?