How to add query results to table in SQL? - sql

I am trying to add the results of a query to an existing table, dependent on the values of an existing column. For example, using the table below
Store
Sales
Weekday
10
11000
Weekday
11
5000
Weekday
12
8000
Weekday
10
19000
Weekend
11
20000
Weekend
12
5000
Weekend
I want the averages per store and weekday which I can get using the following:
SELECT AVG(Sales) AS weekday_avg, store
FROM store_sales
WHERE Weekday = 'Weekday'
GROUP BY store;
But then I'd like to add these results to the same table and store in a column named 'weekday_avg'.
I've tried the following, and while I don't get an error, the column doesn't have any values added:
ALTER TABLE store_sales ADD COLUMN weekday_avg numeric;
UPDATE store_sales SET weekday_avg = (
SELECT AVG(Sales) AS weekday_avg
FROM store_sales
WHERE Weekday = 'Weekday'
GROUP BY store
);
I know this probably isn't best database practice, but I'm working with what has been provided and all I need is to end up with a table with columns for averages per store / weekday type that I can export into R for further analysis.
Many thanks in advance!

As mentioned in other answer, don't store calculated values in table.
But if you want to know how it can be done, then one of the options is to use the corelated query as follows:
UPDATE store_sales s SET s.weekday_avg = (
SELECT AVG(ss.Sales) AS weekday_avg
FROM store_sales ss
WHERE s.Weekday = ss.weekday
And ss.store=s.store
);

Don't store the values. They are easily calculated on the fly:
select ss.*,
avg(sales) over (partition by store, weekday) as weekday_avg_sales
from store_sales ss;
Perhaps if you have a really big table, you might want to store the summary values. But even so, I would recommend a second table for the summaries and joins. With indexes, that would be much more efficient.
Note that calculating the data on the fly means that it is always up-to-date when existing data is updated or new data is inserted.

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.
Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

SQL: Dynamic Join Based on Row Value

Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc
This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end

SELECT MIN from a subset of data obtained through GROUP BY

There is a database in place with hourly timeseries data, where every row in the DB represents one hour. Example:
TIMESERIES TABLE
id date_and_time entry_category
1 2017/01/20 12:00 type_1
2 2017/01/20 13:00 type_1
3 2017/01/20 12:00 type_2
4 2017/01/20 12:00 type_3
First I used the GROUP BY statement to find the latest date and time for each type of entry category:
SELECT MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category;
However now, I want to find which is the date and time which is the LEAST RECENT among the datetime's I obtained with the query listed above. I will need to use somehow SELECT MIN(date_and_time), but how do I let SQL know I want to treat the output of my previous query as a "new table" to apply a new SELECT query on? The output of my total query should be a single value—in case of the sample displayed above, date_and_time = 2017/01/20 12:00.
I've tried using aliases, but don't seem to be able to do the trick, they only rename existing columns or tables (or I'm misusing them..).There are many questions out there that try to list the MAX or MIN for a particular group (e.g. https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ or Select max value of each group) which is what I have already achieved, but I want to do work now on this list of obtained datetime's. My database structure is very simple, but I lack the knowledge to string these queries together.
Thanks, cheers!
You can use your first query as a sub-query, it is similar to what you are describing as using the first query's output as the input for the second query. Here you will get the one row out put of the min date as required.
SELECT MIN(date_and_time)
FROM (SELECT MAX(date_and_time) as date_and_time, entry_category
FROM timeseries_table
GROUP BY entry_category)a;
Is this what you want?
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC;
This returns ties. If you do not want ties, then include an additional sort key:
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC, entry_category;

Calculate stdev over a variable range in SQL Server

Table format is as follows:
Date ID subID value
-----------------------------
7/1/1996 100 1 .0543
7/1/1996 100 2 .0023
7/1/1996 200 1 -.0410
8/1/1996 100 1 -.0230
8/1/1996 200 1 .0121
I'd like to apply STDEV to the value column where date falls within a specified range, grouping on the ID column.
Desired output would like something like this:
DateRange, ID, std_v
1 100 .0232
2 100 .0323
1 200 .0423
One idea I've had that works but is clunky, involves creating an additional column (which I've called 'partition') to identify a 'group' of values over which STDEV is taken (by using the OVER function and PARTITION BY applied to 'partition' and 'ID' variables).
Creating the partition variable involves a CASE statement prior where a given record is assigned a partition based on its date falling within a given range (ie,
...
, partition = CASE
WHEN date BETWEEN '7/1/1996' AND '10/1/1996' THEN 1
WHEN date BETWEEN '10/1/1996' AND '1/1/1997' THEN 2
...
Ideally, I'd be able to apply STDEV and the OVER function partitioning on the variable ID and variable date ranges (eg, say, trailing 3 months for a given reference date). Once this works for the 3 month period described above, I'd like to be able to make the date range variable, creating an additional '#dateRange' variable at the start of the program to be able to run this for 2, 3, 6, etc month ranges.
I ended up coming upon a solution to my question.
You can join the original table to a second table, consisting of a unique list of the dates in the first table, applying a BETWEEN clause to specify desired range.
Sample query below.
Initial table, with columns (#excessRets):
Date, ID, subID, value
Second table, a unique list of dates in the previous, with columns (#dates):
Date
select d.date, er.id, STDEV(er.value)
from #dates d
inner join #excessRet er
on er.date between DATEADD(m, -36, d.date) and d.date
group by d.date, er.id
order by er.id, d.date
To achieve the desired next step referenced above (making range variable), simply create a variable at the outset and replace "36" with the variable.

How to pull the last N number of dates in SAS SQL

I am dealing with a large dataset (30 million rows) and I need to pull the most recent three dates (which may have an indeterminate number of rows attached to them) so like 03MAR2016 might have 2 rows 27FEB2016 might have ten and 25FEB2016 might have 3. How do I say "Select everything that falls within the last X number of values in this set regardless of how many rows there are"?
As you can not sort in an in-line view/subquery you will have to split your SQL statement in two parts:
Sort the date DESCENDING and get the distinct values
Join back to the original data and limit to first 3
But as stated before, SQL is not good at this kind of operation.
DATA input_data ;
INPUT date value ;
CARDS ;
20160101 1
20160101 2
20160101 3
20160102 1
20160103 1
20160104 1
20160105 1
20160105 2
20160105 3
;
proc sql _method;
create table DATE_ID as
select distinct DATE
from input_data
order by DATE DESC;
create table output_data as
select data.*
from (select *
from DATE_ID
where monotonic() <= 3
) id
inner join input_data data
on id.DATE = data.DATE
;
quit;
You need to break this down into two tasks.
Determine which dates are the last three dates
Pull all rows from those dates
Both are possible in SQL, though the first is much easier using other methods (SAS's SQL isn't very good at getting the "first X things").
I would suggest using something like PROC FREQ or PROC TABULATE to generate the list of dates (just a PROC FREQ on the date variable), really any proc you're comfortable with - even PROC SORT would work (though that's probably less efficient). Then once you have that table, limit it to the three highest observations, and then you can use it in a SQL step to join to the main table and filter to those three dates - or you can use other options, like creating a custom format or hash tables or whatever works for you. 30 million rows isn't so many that a SQL join should be a problem, though, I'd think.