Work out variance of groups of rows in SQL - sql

I'm looking to work out a variance value per month for a table of data, with each month containing three rows to be accounted for. I'm struggling to think of a way of doing this without 'looping' which, as far as I'm aware, isn't supported in SQL.
Here is an example table of what I mean:
+======================+=======+
| timestamp | value |
+======================+=======+
| 2020-01-04T10:58:24Z | 10 | # January (Sum of vals = 110)
+----------------------+-------+
| 2020-01-14T10:58:21Z | 68 |
+----------------------+-------+
| 2020-01-29T10:58:12Z | 32 |
+----------------------+-------+
| 2020-02-04T10:58:13Z | 19 | # February (Sum of vals = 112)
+----------------------+-------+
| 2020-02-14T10:58:19Z | 5 |
+----------------------+-------+
| 2020-02-24T10:58:11Z | 88 |
+----------------------+-------+
| 2020-03-04T10:58:11Z | 72 | # March (Sum of vals = 184)
+----------------------+-------+
| 2020-03-15T10:58:10Z | 90 |
+----------------------+-------+
| 2020-03-29T10:58:16Z | 22 |
+----------------------+-------+
| .... | .... |
+======================+=======+
I need to build a query which can combine all 3 values from each item in each month, then work out the variation of the combined value across months. Hopefully this makes sense? So in this case, I would need to work out the variance betweeen January (110), February (112) and March (184).
Does anyone have any suggestions as to how I could accomplish this? I'm using PostgreSQL, but need a vanilla SQL solution :/
Thanks!

Are you looking for aggregation by month and then a variance calculation? If so:
select variance(sum_vals)
from (select date_trunc('month', timestamp) as mon, sum(val) as sum_vals
from t
group by mon
) t;

Related

Filter on date relative to today but the dates are in separate fields

I have a table where the date parts are in separate fields and I am struggling to put a filter on it (pulling all the data is so much that it basically times out).
How can I write a sql query to pull the data for only the past 7 days?
| eventinfo | entity | year | month | day |
|------------|-------------------------|------|-------|-----|
| source=abc | abc=030,abd=203219,.... | 2022 | 08 | 07 |
| source=abc | abc=030,abd=203219,.... | 2022 | 08 | 05 |
| source=abc | abc=030,abd=203219,.... | 2022 | 07 | 33 |
Many thanks in advance.
You can use concatenation on your columns, convert them to date and then apply the filter.
-- Oracle database
select *
from event
where to_date( year||'-'||month||'-'||day,'YYYY-MM-DD') >= trunc(sysdate) - 7;

Group by hourly interval

I'm new to SQL and I have problems when trying to make an hourly report on a database that supports HiveSQL.
Here's my dataset
|NAME| CHECKIN_HOUR |CHECKOUT_HOUR|
|----|--------------|-------------|
| A | 00 | 00 |
| B | 00 | 01 |
| C | 00 | 02 |
| D | 00 | null |
| E | 01 | 02 |
| F | 01 | null |
And I would like to get an hourly summary report that looks like this:
|TIME| CHECKIN_NUMBER |CHECKOUT_NUMBER|STAY_NUMBER|
|----|----------------|---------------|-----------|
| 00 | 4 | 1 | 3 |
| 01 | 2 | 1 | 4 |
| 02 | 0 | 2 | 2 |
stay_number means counting the number of people that haven't checked out by the end of that hour, e.g 2 at the last row means that by the end of 2am, there're two people (D and F) haven't checked out yet. So basically I'm trying to get a summarize check-in, check-out and stay report for each hour.
I've no idea how to compute an hourly interval table since simply grouping by check_in or check_out hour doesn't get the expected result. All the date field is originally in Unix timestamp data type, so feel free to use date functions on it.
Any instructions and help would be greatly appreciated, thanks!
Here is one method that unpivots the data and uses cumulative sums:
select hh,
sum(ins) as checkins, sum(outs) as checkouts,
sum(sum(ins)) over (order by hh) - sum(sum(outs)) over (order by hh)
from ((select checkin_hour as hh, count(*) as ins, 0 as outs
from t
group by checkin_hour
) union all
(select checkout_hour, 0 as ins, count(*) as outs
from t
where checkout_hour is not null
group by checkout_hour
)
) c
group by hh
order by hh;
The idea is to count the number of checks in and check outs in each hour and then accumulate the totals for each hour. The difference is the number of says.

Converting pandas data frame logic to pyspark data frame based logic

Given a data frame with 4 columns group, start_date, available_stock, used_stock. I basically have to figure out how long a stock will last given a group and date. lets say we have a dataframe with the following data
+----------+------------+-----------------+------------+
| group | start_date | available stock | used_stock |
+----------+------------+-----------------+------------+
| group 1 | 01/12/2019 | 100 | 80 |
| group 1 | 08/12/2019 | 60 | 10 |
| group 1 | 15/12/2019 | 60 | 10 |
| group 1 | 22/12/2019 | 150 | 200 |
| group 2 | 15/12/2019 | 80 | 90 |
| group 2 | 22/12/2019 | 150 | 30 |
| group 3 | 22/12/2019 | 50 | 50 |
+----------+------------+-----------------+------------+
Steps:
sort each group by start_date so we get something like the above data set
per group starting from the smallest date we check if the used_stock is greater or equal to the available stock. if it is true the end date is same as start_date
if the above condition is false then add the next dates used_stock to the current used_stock value. continue till the used_stock is greater or equal to available_stock, at which point the end date is same as the start_date of last added used_stock row.
in case no such value is found end date is null
after applying the above steps for every row we should get something like
+----------+------------+-----------------+------------+------------+
| group | start_date | available stock | used_stock | end_date |
+----------+------------+-----------------+------------+------------+
| group 1 | 01/12/2019 | 100 | 80 | 15/12/2019 |
| group 1 | 08/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 15/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 22/12/2019 | 150 | 200 | 22/12/2019 |
| group 2 | 15/12/2019 | 80 | 90 | 15/12/2019 |
| group 2 | 22/12/2019 | 150 | 30 | null |
| group 3 | 22/12/2019 | 50 | 50 | 22/12/2019 |
+----------+------------+-----------------+------------+------------+
the above logic was prebuilt in pandas and was tweaked and applied in the spark application as a grouped map Pandas UDF. I want to move away from #pandas_udf approach and have a pure spark data frame based approach to check if there will be any performance improvements.Would appreciate any help with this or any improvements on the given logic which would reduce the overall execution time.
With spark 2.4+, you can use SparkSQL builtin function aggregate:
aggregate(array_argument, zero_expression, merge, finish)
and implement the logic in the merge and finish expressions, see below for an example:
from pyspark.sql.functions import collect_list, struct, to_date, expr
from pyspark.sql import Window
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
# SQL expression to calculate end_date using aggregate function:
end_date_expr = """
aggregate(
/* argument */
data,
/* zero expression, initialize and specify the aggregator's datatype which is 'struct<end_date:date,total:double>' */
(date(NULL) as end_date, double(0) as total),
/* merge: use acc.total to save accumulated sum of used_stock
* this works similar to Python's reduce function
*/
(acc, y) ->
IF(acc.total >= `available stock`
, (acc.end_date as end_date, acc.total as total)
, (y.start_date as end_date, acc.total + y.used_stock as total)
),
/* finish: post-processing and retrieving only end_date */
z -> IF(z.total >= `available stock`, z.end_date, NULL)
)
"""
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w1)) \
.withColumn('end_date', expr(end_date_expr)) \
.select("group", "start_date", "`available stock`", "used_stock", "end_date") \
.show(truncate=False)
+-------+----------+---------------+----------+----------+
|group |start_date|available stock|used_stock|end_date |
+-------+----------+---------------+----------+----------+
|group 1|2019-12-01|100 |80 |2019-12-15|
|group 1|2019-12-08|60 |10 |2019-12-22|
|group 1|2019-12-15|60 |10 |2019-12-22|
|group 1|2019-12-22|150 |200 |2019-12-22|
|group 2|2019-12-15|80 |90 |2019-12-15|
|group 2|2019-12-22|150 |30 |null |
|group 3|2019-12-22|50 |50 |2019-12-22|
+-------+----------+---------------+----------+----------+
Note: this could be less efficient if many of the groups contain a large list of rows(i.e. 1000+ rows), when most of them require to just scan limited rows (i.e. less than 20) to find the first row satisfying the condition. In such case, you might set up two Window specs and do the calculation in two rounds:
from pyspark.sql.functions import collect_list, struct, to_date, col, when, expr
# 1st scan up to the N following rows which can cover majority of end_date satisfying the condition
N = 20
w2 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, N)
# 2nd scan will cover the full length but only to rows having end_date is NULL
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w2)) \
.withColumn('end_date', expr(end_date_expr)) \
.withColumn('data',
when(col('end_date').isNull(), collect_list(struct('start_date','used_stock')).over(w1))) \
.selectExpr(
"group",
"start_date",
"`available stock`",
"used_stock",
"IF(end_date is NULL, {0}, end_date) AS end_date".format(end_date_expr)
).show(truncate=False)

SQL sum 12 weeks of data based on first sold date across different items

The database has thousands of individual items, each with multiple first sold dates and sales results by week. I need a total sum for each products first 12 weeks of sales.
Code was used for previous individual queries when we know the start date using a SUM(CASE. This is too manual though with thousands of products to review and we are looking for a smart way to speed this up.
Can I build on this so the sum find the minimum first shop date, and then sums the next 12 weeks of results? If so, how do I structure it, or is there a better way?
Columns in database I will need to reference with sample data
PROD_ID | WEEK_ID | STORE_ID | FIRST_SHOP_DATE | ITM_VALUE
12345543 | 201607 | 10000001 | 201542 | 24,356
12345543 | 201607 | 10000002 | 201544 | 27,356
12345543 | 201608 | 10000001 | 201542 | 24,356
12345543 | 201608 | 10000002 | 201544 | 27,356
32655644 | 201607 | 10000001 | 201412 | 103,245
32655644 | 201607 | 10000002 | 201420 | 123,458
32655644 | 201608 | 10000001 | 201412 | 154,867
32655644 | 201608 | 10000002 | 201420 | 127,865
You can do something like this:
select itemid, sum(sales)
from (select t.*, min(shopdate) over (partition by itemid) as first_shopdate
from t
) t
where shopdate < first_stopdate + interval '84' day
group by id;
You don't specify the database, so this uses ANSI standard syntax. The date operations (in particular) vary by database.
Hi Kirsty, Try like this -
select a.Item,sum(sales) as totla
from tableName a JOIN
(select Item, min(FirstSoldDate) as FirstSoldDate from tableName group by item) b
ON a.Item = b.Item
where a.FirstSoldDate between b.FirstSoldDate and (dateadd(day,84,b.FirstSoldDate))
group by a.Item
Thanks :)

Finding the max value between the last 22 months or between any 10 hour window within the last 22 months in Microsoft SQL Server

I'd like to find the max value within the last 22 months OR the max value within any 10 hour window of those last 22 months.
I'm doing this in Microsoft SQL Server.
Essentially, I'm looking to retrieve a value that has sustained a high for at least 10 hours before I consider it my max and if it is larger than the max of the last 22 months, it would be the new max, otherwise I would use the max of the last 22 months.
Here's what I think it should look like pseudo code:
if (time > 10 hours) AND (value = max) OR (18 > time > 0) AND (value = max)
then output = value
The SQL code that I've tried:
SELECT TOP 90 PERCENT
DATEADD(s,time,'19700101') as time_22month
,GETDATE() as date_22month
,b.tagname as tag_22month
,value as value_22month
,maximum as max_22month
FROM
db..hour a
INNER JOIN
db..tag b
ON
a.tagid = b.tagid
WHERE
b.tagname like '%T500.1234%'
AND
(GETDATE() - DATEADD(s, time, '19700101') < 670)
ORDER BY
max_22month DESC
SELECT
DATEADD(s,time,'19700101') as time_10hour
,GETDATE() as date_10hour
,b.tagname as tag_10hour
,value as value_10hour
,maximum as max_10hour
FROM
db..hour a
INNER JOIN
db..tag b
ON
a.tagid = b.tagid
WHERE
b.tagname like '%T500.1234%'
AND
(GETDATE() - DATEADD(s, time, '19700101') < 0.42)
ORDER BY
max_10hour DESC
Output right now is the following:
+-------------------------+----------------------------+-------------+----------------+---------------+
| time_22month | date_22month | tag_22month | value_22month | max_22month |
+-------------------------+----------------------------+-------------+----------------+---------------+
| 2016-03-08 06:00:00.000 | 2017-04-10 10:07:57:32.783 | T500.1234 | 1567.88546416 | 2445.56419848 |
| 2016-03-08 07:00:00.000 | 2017-04-10 10:07:57:32.783 | T500.1234 | 1499.88546416 | 2434.47673719 |
+-------------------------+----------------------------+-------------+----------------+---------------+
+-------------------------+----------------------------+------------+---------------+---------------+
| time_10hour | date_10hour | tag_10hour | value_10hour | max_10hour |
+-------------------------+----------------------------+------------+---------------+---------------+
| 2017-04-10 00:00:00.000 | 2017-04-10 10:07:57:32.783 | T500.1234 | 8763.42572454 | 8759.64548912 |
| 2017-04-10 01:00:00.000 | 2017-04-10 10:07:57:32.783 | T500.1234 | 8001.64578943 | 8001.64578943 |
+-------------------------+----------------------------+------------+---------------+---------------+
So I'm a little confused on how I should be comparing these max values, especially when the 10 hour window needs to be rolling (incrementing every hour). Any help is appreciated.
The output should be the greater value of the two parameters, so perhaps a new table column would be the output along with two columns that precede it that show the highest 22month value and the highest 10 hour window value.
+--------+-------------+------------+------+
| Month | 22Month_Max | 10Hour_Max | Max |
+--------+-------------+------------+------+
| July | 5478 | 5999 | 5999 |
| August | 4991 | 3523 | 4991 |
+--------+-------------+------------+------+