Stata Create panel dataset with two dataframes, no common variable - dataframe

I am creating a city-by-day panel from scratch, but I'm having trouble balancing and filling in the data. Every city needs to have an observation every day between 01jan2000 and 31dec2019, my variable of interest is a dummy variable recording whether or not an event took place on that day in that city.
My original dataset only recorded observations if event == 1, and I managed to fill in time gaps using tsfill, but I can't figure out how to balance the data or extend it to start on 01jan2000 and 31dec2019. I need every date and city because eventually it will be merged with data that uses that sample period.
My current approach is to create a balanced & filled in panel and then merge the event data using the date it took place. I have a stata df containing the 7,305 dates, and another containing the 273 cityid's I'm observing. Is it possible to generate a new df that combines these two so all 273 cities are observed every day? essentially there will be 273 x 7,304 observations, no variables of interest.
Any help figuring out how to solve the unbalanced issue using either of these approaches is hugely appreciated.

Related

How to resample and extend datetime data in this pandas DataFrame?

if have a DataFrame like the one in the photo (https://i.stack.imgur.com/5u3WR.png), and I would like to have, for each grid point the same time series (repeated over and over again), namely:
t_index_np = np.arange('2013-01-01', '2022-12-31 23:00:00', dtype='datetime64[h]')
The frequency is hourly.
You have to take into account that for many grid points there is only one associated date.
What I have tried so far is using a for cycle with resample and pd.merge, but the problem there is that it doesn't work for such points (those with only one date data). Concerning the Total Power column, it must be forward-filled.
Thanks in advance!

Need to divide a Dataframe in various tables using multiple categories and date time

this is my first time asking a question here, so if I'm doing something wrong please guide me to the right place. I have a big and clean dataset. (29000+ , 24). The thing is that I have to calculate the churn rate based on 4 different categorical columns, and I'm given just 1 column that contains the subs for a given period. I have a date column too. My idea on calculating the churn is to do
churn_rate= (Sub_start_period-Sub_end_period)/Sub_start_period*100
The Problem
I don't know how to group the data using these 4 different categorical variables. Also If I manage to do so I would end up with more than 200 different tables, so I don't believe this would be a good approach.
My goal is able to predict the churn rate using the information in the table but I should be able to determine the churn rate based on these variables. The churn is not given, it has to be calculated, so I'm having problems here as I can't think of a way of working through this.

Multi-dimensional dataframe or multiple 2D dataframes

A colleague wrote some code to create a price lookup table for products where the prices change throughout the year. He also stores other information like the name of the season, when it starts, ends, etc. His code takes nine minutes to run on a beefy machine.
His approach is the traditional SQL-loop-over-records algorithms. I wanted to see if I could do better using matrices, so I wrote a price table (of only prices) using Pandas. My code runs in 21 seconds on a Macbook Air. Cool.
My next step is to add in other attributes like name of the season, when it starts, ends, etc. It's my understanding that I shouldn't store objects in my dataframes because that will reduce speed, is Bad Practice, etc.
I think I have two options: 1. for each new piece of data add another dimension so the shape of my dataframe would go from (product X days) to (product X days X season_name X season_start X season_end) or 2. I would just create a new dataframe for each attribute and jump back and forth between them as necessary.
My goal is to use pandas for very quick lookups and calculations of data.
Or is there a better more pandas-ish way to do this?

Qlikview line chart with multiple expressions over time period dimension

I am new to Qlikview and after several failed attempts I have to ask for some guidance regarding charts in Qlikview. I want to create Line chart which will have:
One dimension – time period of one month broke down by days in it
One expression – Number of created tasks per day
Second expression – Number of closed tasks per day
Third expression – Number of open tasks per day
This is very basic example and I couldn’t find solution for this, and to be honest I think I don’t understand how I should setup my time period dimension and expression. Each time when I try to introduce more then one expression things go south. Maybe its because I have multiple dates or my dimension is wrong.
Here is my simple data:
http://pastebin.com/Lv0CFQPm
I have been reading about helper tables like Master Callendar or “Date Island” but I couldn’t grasp it. I have tried to follow guide from here: https://community.qlik.com/docs/DOC-8642 but that only worked for one date (for me at least).
How should I setup dimension and expression on my chart, so I can count the ID field if Created Date matches one from dimension and Status is appropriate?
I have personal edition so I am unable to open qwv files from other authors.
Thank you in advance, kind regards!
My solution to this would be to change from a single line per Call with associated dates to a concatenated list of Call Events with a single date each. i.e. each Call will have a creation event and a resolution event. This is how I achieve that. (I turned your data into a spreadsheet but the concept is the same for any data source.)
Calls:
LOAD Type,
Id,
Priority,
'New' as Status,
date(floor(Created)) as [Date],
time(Created) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Created>0;
LOAD Type,
Id,
Priority,
Status,
date(floor(Resolved)) as [Date],
time(Resolved) as [Time]
FROM
[Calls.xlsx]
(ooxml, embedded labels, table is Sheet1) where Resolved>0;
Key concepts here are allowing QlikView's auto-conatenate to do it's job by making the field-names of both load statements exactly the same, including capitalisation. The second is splitting the timestamp into a Date and a time. This allows you to have a dimension of Date only and group the events for the day. (In big data sets the resource saving is also significant.) The third is creating the dummy 'New' status for each event on the day of it's creation date.
With just this data and these expressions
Created = count(if(Status='New',Id))
Resolved = count(if(Status='Resolved',Id))
and then
Created-Resolved
all with full accumulation ticked for Open (to give you a running total rather than a daily total which might go negative and look odd) you could draw this graph.
For extra completeness you could add this to the code section to fill up your dates and create the Master Calendar you spoke of. There are many other ways of achieving this
MINMAX:
load floor(num(min([Date]))) as MINTRANS,
floor(num(max([Date]))) as MAXTRANS
Resident Calls;
let zDateMin=FieldValue('MINTRANS',1);
let zDateMax=FieldValue('MAXTRANS',1);
//complete calendar
Dates:
LOAD
Date($(zDateMin) + IterNo() - 1, '$(DateFormat)') as [Date]
AUTOGENERATE 1
WHILE $(zDateMin)+IterNo()-1<= $(zDateMax);
Then you could draw this chart. Don't forget to turn Suppress Zero Values on the Presentation tab off.
But my suggestion would be to use a combo rather than line chart so that the calls per day are shown as discrete buckets (Bars) but the running total of Open calls is a line

Two Dimensional Diagram with aggr function

I'm having a very curious Problem in QlikView.
I have a number of readouts from a Database which show certain amounts of time in a different state.
In that table there are 49 variables that describe the state, there are 7 levels of i.e the SOC and seven states of the Temperature.
i.e the one of the fields could be named: SOC1_T1 or SOC2_T1 and so on...
So what i get is a table full of readouts in which every i have an specific id for the object, the state of the variables and an age. There are multiple entries per Object.
What i want to do is to plot a two dimensional diagram over all the states so i get SOC over Temperatur Histogram(Average of the maximum (or newest) value of every object).
I tried creating to Dynamic (or syntethic) Dimensions (ValueLoop(1,7) and ValueLoop(1,8).
In the formulas i reffered to them with
=If(ValueLoop(1,7) = 1 and ValueLoop(1,8) = 1,
(avg(aggr(FirstSortedValue (SOC1_T1
, -age), id)) * 100))
and created 49 Formulas with each state variable output.
Problem now is:
It only shows the first entry. I can replace the whole expression in the if condition with a specific number (100) and get a result. I also plotted the inner expression into a Listbox and checked wheter the result is not null.
As soon as I delete the aggr function and just take the AVG over everything (which is not what i want). Everything works fine. When i turn back to aggr, only the first one is shown.
Doesnt help by the way when i delete one of the dimensions, this doesnt work one dimensional either.
Any ideas or workarounds?
Greetings
Julian