Quicksight Calculated field: sum of average? - data-visualization

The dataset I have is currently like so:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
total_views_per_country_and_day is already pre-calculated to be the sum grouped by country and day. That is why for each country-day pair, the number is the same.
I have a Quicksight analysis with a filter for day.
The first thing I want is to have a table on my dashboard that shows the number of total views for each country.
However, if I were to do it with the dataset just like that, the table would sum everything:
country
total_views
USA
900+900+900=2700
UK
350+350=700
So what I did was, create a calculated field which is the average of total_views. Which worked---but only if my day filter on dashboard was for ONE day.
When filtered for day = 2022-06-15: correct
country
avg(total_views)
USA
2700/3=900
UK
700/2=350
But let's say we have data from 2022-06-16 as well, the averaging method doesn't work, because it will average based on the entire dataset. So, example dataset with two days:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
USA
ABC
iPhone11
2
1000
2022-06-16
USA
ABC
iPhoneX
5
1000
2022-06-16
UK
ABC
iPhone11
10
500
2022-06-16
UK
DEF
iPhone11
20
500
2022-06-16
Desired Table Visualization:
country
total_views
USA
900 + 1000 = 1900
UK
350 + 500 = 850
USA calculation: (900 * 3)/3 + (1000 * 2) /2 = 900 + 1000
UK calculation: (350 * 2) /2 + (500 * 2) /2 = 350 + 500
Basically---a sum of averages.
However, instead it is calculated like:
country
avg(total_views)
USA
[(900 * 3) + (1000*2)] / 5 = 940
UK
[(350 * 2) + (500 * 2)] / 4 = 425
I want to be able to use this calculation later on as well to calculate num_purchases / total_views. So ideally I would want it to be a calculated field. Is there a formula that can do this?
I also tried, instead of calculated field, just aggregating total_views by average instead of sum in the analysis -- exact same issue, but I could actually keep a running total if I include day in the table visualization. E.G.
country
day
running total of avg(total_views)
USA
2022-06-15
900
USA
2022-06-16
900+1000=1900
UK
2022-06-15
350
UK
2022-06-16
350+500=850
So you can see that the total (2nd and 4th row) is my desired value. However this is not exactly what I want.. I don't want to have to add the day into the table to get it right.
I've tried avgOver with day as a partition, that also requires you to have day in the table visualization.

sum({total_views_per_country_and_day}) / distinct_count( {day})
Basically your average is calculated as sum of metric divided by number of unique days. The above should help.

Related

How to find the right budget value into the activities in Tableau (Relationship)

I have related two custom sql query in Tableau (via making relationship)
The outcome of the queries looks like :
Q1 : (It shows the starting time of the valid budget.If a user has multiple rows in this table, it means his/her budget has been updated with new amount)
id_user
budgete_start_date
budget_amount
1234
06-11-2021
120
1234
06-07-2022
200
56789
06-01-2022
1200
56789
06-07-2022
2000
643
05-05-2022
30
Q2 :(It shows the budget usage)
id_user
budgete_usage_date
amount_usage
1234
01-12-2021
50
1234
05-08-2022
100
56789
10-02-2022
60
56789
07-08-2022
500
643
05-07-2022
17
I need to find a way to create the following view to know what was the valid budget at
budgete_usage_date.
id_user
budgete_usage_date
amount_usage
valid budget
1234
01-12-2021
50
120
1234
05-08-2022
100
200
56789
10-02-2022
60
1200
56789
07-08-2022
500
2000
643
05-07-2022
17
30
How can I do that with calculated field in Tableau (with db made by relationship)?
If that's not possible, how can I do that directly in query? (changing the relationship to a single query)

How to groupby external time series data together

I have been trying to groupby some state data together. This is how my data looks like for example, with Date as the index and the rest is features:
Date
Population
Num_Men
Num_Women
State
Region
2020-01-01
500
300
200
NY
North
2020-02-01
800
500
300
GL
Middle
2020-02-01
1000
400
600
""
Middle
2020-02-01
200
50
150
nan
Middle
2020-02-01
600
400
200
NY
North
I know how to group the NY states ones, but if I want to group the ones with state values: GL, "", and nan together. I'm not sure how to do that.
I was looking for the final result to look like:
Date
Population
Num_Men
Num_Women
State
Region
2020-01-01
500
300
200
NY
North
2020-02-01
2000
950
1050
GL
Middle
2020-02-01
600
400
200
NY
North
I did something like this: df.groupby(df.index, {'State': ["GL", "", np.nan]}, but that didn't work. Any help would be appreciated! Thanks!
Let us do replace then groupby with sum and first
df.State = df.State.replace({"''":np.nan,'nan':np.nan})
out = df.groupby(['Region','Date'],as_index=False).\
agg({'Population':'sum',
'Num_Men':'sum',
'Num_Women':'sum',
'State':'first'})
Out[99]:
Region Date Population Num_Men Num_Women State
0 Middle 2020-02-01 2000 950 1050 GL
1 North 2020-01-01 500 300 200 NY
2 North 2020-02-01 600 400 200 NY

How to assign equal revenue weight to every location of a company in a table? Google Big Query

I am working on a problem where I have the following table:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 1200
2 Italy 1200
3 USA 1200
3 UK 1200
3 Italy 1200
There are 3 companies in this table, but company '2' and company '3' have offices in 2 and 3 countries respectively. All companies pay 1200 per month, and because company 2 has 2 offices it shows as if they paid 1200 per month 2 times, and because company 3 has 3 offices it shows as if it paid 1200 per month 3 times. Instead, I would like revenue to be equally distributed based on how many times company_id appears in the table. company_id will only appear more than once for every additional country in which a company is based.
Assuming each company always pays 1,200 per month, my desired output is:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 600
2 Italy 600
3 USA 400
3 UK 400
3 Italy 400
Being new to SQL, I was thinking this can maybe be done through CASE WHEN statement, but I only learned to use CASE WHEN when I want to output a string depending on a condition. Here, I am trying to assign equal revenue weight to each company's country, depending on in how many countries a company is based in.
Thank you in advance for you help!
Below is for BigQuery Standard SQL
#standardSQL
SELECT company_id, country,
total_revenue / (COUNT(1) OVER(PARTITION BY company_id)) AS total_revenue
FROM `project.dataset.table`
If to apply to sample data from your question - output is
Row company_id country total_revenue
1 1 Russia 1200.0
2 2 Croatia 600.0
3 2 Italy 600.0
4 3 USA 400.0
5 3 UK 400.0
6 3 Italy 400.0

Normalize time variable for recurrent LSTM Neural Network using Keras

I am using Keras to create an LSTM neural-network that can predict the concentration in the blood of a certain drug. I have a dataset with time stamps on which a drug dosage was administered and when the concentration in the blood was measured. These dosage and measurement time stamps are disjoint. Furthermore several other variables are measured at all time steps (both dosage and measurements). These variables are the input for my model along with the dosages (0 when no dosage was given at time t). The observed concentration in the blood is the response variable.
I have normalized all input features using the MinMaxScaler().
Q1:
Now I am wondering, do I need to normalize the time variable that corresponds with all rows as well and give it as input to the model? Or can I leave this variable out since the time steps are equally spaced?
The data looks like:
PatientID Time Dosage DosageRate DrugConcentration
1 0 100 12 NA
1 309 100 12 NA
1 650 100 12 NA
1 1030 100 12 NA
1 1320 NA NA 12
1 1405 100 12 NA
1 1812 90 8 NA
1 2078 90 8 NA
1 2400 NA NA 8
2 0 120 13.5 NA
2 800 120 13.5 NA
2 920 NA NA 16
2 1515 120 13.5 NA
2 1832 120 13.5 NA
2 2378 120 13.5 NA
2 2600 120 13.5 NA
2 3000 120 13.5 NA
2 4400 NA NA 2
As you can see, the time between two consecutive dosages and measurements differs for a patient and between patients, which makes the problem difficult.
Q2:
One approach I can think of is aggregating on measurements intervals and taking the average dosage and SD between two measurements. Then we only predict on time stamps of which we know the observed drug concentration. Would this work, or would we lose to much information?
Q3
A second approach I could think of is create new data points, so that all intervals between dosages are the same and set the dosage and dosage rate at those time points to zero. The disadvantage is then, that we can only calculate the error on the time stamps on which we know the observed drug concentration. How should we tackle this?

Access the previous row in select

I have a scenario as below
--source data
departuredttm flight_source flight_destination available_seats
13-07-2016 04:00:00 A B 200
13-07-2016 08:00:00 A B 320
13-07-2016 08:20:00 A B 20
I have a lookup table which tell how many total passengers are there for this source and destinatin whose flights are delayed and needs to adjusted in available seats in source data.lookup table is like this.
--lookup table for passenger_from_delayed_flights
flight_source flight_destination passengers
A B 500
now I have to adjust these 500 passengers in available seats as in source data
---output
DepartureDttm flight_source flight_destination AVAILABLE_SEATS PASSENGERS_TO_ADJUST PASSENGER_LEFT
13-07-2016 04:00:00 A B 200 500 300
13-07-2016 08:00:00 A B 320 300 20
13-07-2016 08:20:00 A B 20 20 0
initially passenger to adjust is 500 where we have 200 seats , next 320 seats are available and we have to adjust 300 (500-200) passengers.
Please help
Thanks
Your expected result is probably wrong, the 2nd flight already has enough seats, so PASSENGER_LEFT should be -20 (or 0).
This is a calculation based on a running total:
passengers - SUM(available_seats)
OVER (ORDER BY departuredttm
ROWS UNBOUNDED PRECEDING) AS PASSENGER_LEFT
available_seats + PASSENGER_LEFT AS PASSENGERS_TO_ADJUST