Average on the most recent date for which data is available - sql

I have a table (on BigQuery) that looks like the following:
Date
Type
Score
2021-01-04
A
5
2021-01-04
A
4
2021-01-04
A
5
2021-01-02
A
1
2021-01-02
A
1
2021-01-01
A
3
2021-01-04
B
NULL
2021-01-04
B
NULL
2021-01-02
B
NULL
2021-01-02
B
NULL
2021-01-01
B
2
2021-01-01
B
5
2021-01-04
C
NULL
2021-01-04
C
4
2021-01-04
C
NULL
2021-01-01
C
1
2021-01-01
C
2
2021-01-01
C
3
What I would like to get is the average score for each type but the average should be taken only on the most recent date for which at least one score is available for the type. From the example above, the aim is to obtain the following table:
Type
AVG Score
A
(5+4+5)/3
B
(2+5)/2
C
(4)/1
I need a solution that could be adapted if I want the average score, not for each type, but for each combination of two columns (type/color), still on the most recent date for which at least one score is available for the combination.

An alternative solution is as given below, you can try it:-
SELECT type, AVG(score)
FROM mytable
WHERE score IS NOT NULL
and (type, date1) in (
SELECT (type, max(cast (date1 as date)))
FROM mytable
WHERE score IS NOT NULL
GROUP BY type
)
GROUP BY type

This answers the original question.
One method uses window functions:
select type, avg(score)
from (select t.*,
dense_rank() over (partition by type order by date desc) as seqnum
from t
where score is not null
) t
where seqnum = 1
group by type;

Related

Computing window functions for multiple dates

I have a table sales of consisting of user id's, products those users have purchased, and the date of purchase:
date
user_id
product
2021-01-01
1
apple
2021-01-02
1
orange
2021-01-02
2
apple
2021-01-02
3
apple
2021-01-03
3
orange
2021-01-04
4
apple
If I wanted to see product counts based on every users' most recent purchase, I would do something like this:
WITH latest_sales AS (
SELECT
date
, user_id
, product
, row_number() OVER(PARTITION BY user_id ORDER BY date DESC) AS rn
FROM
sales
)
SELECT
product
, count(1) AS count
FROM
latest_sales
WHERE
rn = 1
GROUP BY
product
Producing:
product
count
apple
2
orange
2
However, this will only produce results for my most recent date. If I looked at this on 2021-01-02. The results would be:
product
count
apple
2
orange
1
How could I code this so I could see counts of the most recent products purchased by user, but for multiple dates?
So the output would be something like this:
date
product
count
2021-01-01
apple
1
2021-01-01
orange
0
2021-01-02
apple
2
2021-01-02
orange
1
2021-01-03
apple
1
2021-01-03
orange
2
2021-01-04
apple
2
2021-01-04
orange
2
Appreciate any help on this.
I'm afraid the window function row_number() with the PARTITION BY user_id clause is not relevant in your case because it only focusses on the user_id of the current row whereas you want a consolidate view with all the users.
I dont have a better idea than doing a self-join on table sales :
WITH list AS (
SELECT DISTINCT ON (s2.date, user_id)
s2.date
, product
FROM sales AS s1
INNER JOIN (SELECT DISTINCT date FROM sales) AS s2
ON s1.date <= s2.date
ORDER BY s2.date, user_id, s1.date DESC
)
SELECT date, product, count(*)
FROM list
GROUP BY date, product
ORDER BY date
see the test result in dbfiddle

Specific grouping elements in SQL Server

I've got a problem with my SQL task and didn't find any answers yet.
I've got table with this sample data:
ID
Value
Date
1
1
2020-01-01
1
2
2020-03-02
1
1
2020-03-21
1
1
2020-04-14
1
3
2020-05-01
1
1
2020-08-09
1
1
2020-09-12
1
1
2020-10-12
1
3
2020-12-04
All I want to get is:
ID
Value
Date
1
1
2020-01-01
1
2
2020-03-02
1
1
2020-03-21
1
3
2020-05-01
1
1
2020-08-09
1
3
2020-12-04
Some kind of changing value history, but only if the value was changed - when value on new record is the same, get value with min date.
I tried with grouping and row_number, but got no positive results. Any ideas how to do that?
One way to articulate your logic is to say that you want to retain a record when the previous record, as ordered by the date (within a given ID), has a different value than the current record.
WITH cte AS (
SELECT *, LAG(Value) OVER (PARTITION BY ID ORDER BY Date) LagValue
FROM yourTable
)
SELECT ID, Value, Date
FROM cte
WHERE LagValue <> Value OR LagValue IS NULL
ORDER BY Date;
Demo

Calculate table's record count for one of Version Date under in same dataset

There are at least two tables look likes below which under the same dataset.
And only one field VERSION_DATE are match with them.
Here I want to calculate all the record count when Version_DATE is equal to one specific date.
For example 2021-01-01.
TABLE A
VERSION_DATE
STUDENT_SCORE
2021-01-01
88
2021-01-01
98
2021-01-02
38
2021-01-02
48
2021-01-02
100
TABLE B
VERSION_DATE
CITY_SCORE
NAME
2021-01-01
45
A
2021-01-01
72
B
2021-01-01
53
C
2021-01-01
83
D
2021-01-02
16
A
Expected Result:
VERSION_DATE
COUNT
2021-01-01
6
Just use UNION ALL
select version_date, count(*)
from (
select version_date from table_a where version_date = '2021-01-01'
union all
select version_date from table_b where version_date = '2021-01-01'
)
Try below, but it really depends on what other tables in your dataset that match the table* pattern - if all of them have VERSION_DATE field - you should be OK. But if not it depends on the most recent created or updated table with same name pattern - it must have such field - otherwise below query will fail
select VERSION_DATE, count(1) as `COUNT`
from `yourdataset.table*`
where _table_suffix in ('a', 'b')
group by VERSION_DATE

Netezza add new field for first record value of the day in SQL

I'm trying to add new columns of first values of the day for location and weight.
For instance, the original data format is:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
1 1/1/20 19:07:00 B 41.1
2 1/1/20 08:01:00 B 73.2
2 1/1/20 21:00:00 B 73.2
2 1/2/20 10:03:00 C 74
I want each id to have only one day record, such as:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
2 1/1/20 08:01:00 B 73.2
2 1/2/20 10:03:00 C 74
I have other columns in my data set that I'm using location and weight to create, so I don't think I can just filter for 'first' records of the day.. Is it possible to write query to recognize first record of the day for those two columns and create new column with those values?
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id, ddtm::date order by dttm) as seqnum
from t
) t
where seqnum = 1;

Select min/max from group defined by one column as subgroup of another - SQL, HPVertica

I'm trying to find the min and max date within a subgroup of another group. Here's example 'data'
ID Type Date
1 A 7/1/2015
1 B 1/1/2015
1 A 8/5/2014
22 B 3/1/2015
22 B 9/1/2014
333 A 8/1/2015
333 B 4/1/2015
333 B 3/29/2014
333 B 2/28/2013
333 C 1/1/2013
What I'd like to identify is - within an ID, what is the min/max Date for each block of similar Type? So for ID # 333 I want the below info:
A: min & max = 8/1/2015
B: min = 2/28/2013
max = 4/1/2015
C: min & max = 1/1/2013
I'm having trouble figuring out how to identify only uninterrupted groupings of Type within a grouping of ID. For ID #1, I need to keep the two 'A' Types with separate min/max dates because they were split by a Type 'B', so I can't just pull the min date of all Type A's for ID #1, it has to be two separate instances.
What I've tried is something like the below two lines, but neither of these accurately captures the case mentioned above for ID #1 where Type B interrupts Type A.
Max(Date) OVER (Partition By ID, Type)
or this:
Row_Number() OVER (Partition By ID, Type ORDER BY Date DESC)
,then selecting Row #1 for max date, and date ASC w/ row #1 for min date
Thank you for any insight you can provide!
If I understand right, you want the min/max values for an id/type grouped using a descending date sort, but the catch is that you want them based on clusters within the id by time.
What you can do is use CONDITIONAL_CHANGE_EVENT to tag the rows on change of type, then use that in your GROUP BY on a standard min/max aggregation.
This would be the intermediate step towards getting to what you want:
select ID, Type, Date,
CONDITIONAL_CHANGE_EVENT(Type) OVER( PARTITION BY ID ORDER BY Date desc) cce
from mytable
group by ID, Type, Date
order by ID, Date desc, Type
ID Type Date cce
1 A 2015-07-01 00:00:00 0
1 B 2015-01-01 00:00:00 1
1 A 2014-08-05 00:00:00 2
22 B 2015-03-01 00:00:00 0
22 B 2014-09-01 00:00:00 0
333 A 2015-08-01 00:00:00 0
333 B 2015-04-01 00:00:00 1
333 B 2014-03-29 00:00:00 1
333 B 2013-02-28 00:00:00 1
333 C 2013-01-01 00:00:00 2
Once you have them grouped using CCE, you can do an aggregate on this to get the min/max you are looking for grouping on cce. You can play with the order by at the bottom, this ordering seem to make the most sense to me.
select id, type, min(date), max(date)
from (
select ID, Type, Date,
CONDITIONAL_CHANGE_EVENT(Type) OVER( PARTITION BY ID ORDER BY Date desc) cce
from mytable
group by ID, Type, Date
) x
group by id, type, cce
order by id, 3 desc, 4 desc;
id type min max
1 A 2015-07-01 00:00:00 2015-07-01 00:00:00
1 B 2015-01-01 00:00:00 2015-01-01 00:00:00
1 A 2014-08-05 00:00:00 2014-08-05 00:00:00
22 B 2014-09-01 00:00:00 2015-03-01 00:00:00
333 A 2015-08-01 00:00:00 2015-08-01 00:00:00
333 B 2013-02-28 00:00:00 2015-04-01 00:00:00
333 C 2013-01-01 00:00:00 2013-01-01 00:00:00