Need to find Average of top 3 records grouped by ID in SQL - sql

I have a postgres table with customer ID's, dates, and integers. I need to find the average of the top 3 records for each customer ID that have dates within the last year. I can do it with a single ID using the SQL below (id is the customer ID, weekending is the date, and maxattached is the integer).
One caveat: the maximum values are per month, meaning we're only looking at the highest value in a given month to create our dataset, thus why we're extracting month from the date.
SELECT
id,
round(avg(max),0)
FROM
(
select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max
FROM
myTable
WHERE
weekending >= now() - interval '1 year' AND
id=110070 group by id,month,year
ORDER BY
max desc limit 3
) AS t
GROUP BY id;
How can I expand this query to include all ID's and a single averaged number for each one?
Here is some sample data:
ID | MaxAttached | Weekending
110070 | 5 | 2011-11-10
110070 | 6 | 2011-11-17
110071 | 4 | 2011-11-10
110071 | 7 | 2011-11-17
110070 | 3 | 2011-12-01
110071 | 8 | 2011-12-01
110070 | 5 | 2012-01-01
110071 | 9 | 2012-01-01
So, for this sample table, I would expect to receive the following results:
ID | MaxAttached
110070 | 5
110071 | 8
This averages the highest value in a given month for each ID (6,3,5 for 110070 and 7,8,9 for 110071)
Note: postgres version 8.1.15

First - get the max(maxattached) for every customer and month:
SELECT id,
max(maxattached) as max_att
FROM myTable
WHERE weekending >= now() - interval '1 year'
GROUP BY id, date_trunc('month',weekending);
Next - for every customer rank all his values:
SELECT id,
max_att,
row_number() OVER (PARTITION BY id ORDER BY max_att DESC) as max_att_rank
FROM <previous select here>;
Next - get the top 3 for every customer:
SELECT id,
max_att
FROM <previous select here>
WHERE max_att_rank <= 3;
Next - get the avg of the values for every customer:
SELECT id,
avg(max_att) as avg_att
FROM <previous select here>
GROUP BY id;
Next - just put all the queries together and rewrite/simplify them for your case.
UPDATE: Here is an SQLFiddle with your test data and the queries: SQLFiddle.
UPDATE2: Here is the query, that will work on 8.1 :
SELECT customer_id,
(SELECT round(avg(max_att),0)
FROM (SELECT max(maxattached) as max_att
FROM table1
WHERE weekending >= now() - interval '2 year'
AND id = ct.customer_id
GROUP BY date_trunc('month',weekending)
ORDER BY max_att DESC
LIMIT 3) sub
) as avg_att
FROM customer_table ct;
The idea - to take your initial query and run it for every customer (customer_table - table with all unique id for customers).
Here is SQLFiddle with this query: SQLFiddle.
Only tested on version 8.3 (8.1 is too old to be on SQLFiddle).

8.3 version
8.3 is the oldest version I've got access to, so I can't guarantee it'll work in 8.1
I'm using a temporary table to work out the best three records.
CREATE TABLE temp_highest_per_month as
select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max_in_month,
0 as priority
FROM
myTable
WHERE
weekending >= now() - interval '1 year'
group by id,month,year;
UPDATE temp_highest_per_month t
SET priority =
(select count(*) from temp_highest_per_month t2
where t2.id = t.id and
(t.max_in_month < t2.max_in_month or
(t.max_in_month= t2.max_in_month and
t.year * 12 + t.month > t2.year * 12 + t.month)));
select id,round(avg(max_in_month),0)
from temp_highest_per_month
where priority <= 3
group by id;
The year & month are included in the working out the priority so that if two months have the same maximum, they'll still be included in the numbering correctly.
9.1 version
Similar to Igor's answer, but I used the With clause to split the steps.
with highest_per_month as
( select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max_in_month
FROM
myTable
WHERE
weekending >= now() - interval '1 year'
group by id,month,year),
prioritised as
( select id, month, year, max_in_month,
row_number() over (partition by id, month, year
order by max_in_month desc)
as priority
from highest_per_month
)
select id, round(avg(max_in_month),0)
from prioritised
where priority <= 3
group by id;

Related

Calculating Aggregates on subset of data based on condition

I have a DB as follows:
| company | timestamp | value |
| ------- | ---------- | ----- |
| google | 2020-09-01 | 5 |
| google | 2020-08-01 | 4 |
| amazon | 2020-09-02 | 3 |
I'd like to calculate the average value for each company within the last year if there are >= 20 datapoints. If there are less than 20 datapoints then I'd like the average during the entire time duration. I know I can do two separate queries and get the averages for each scenario. The question I suppose is how do I merge them back in a single table based on the criteria I have.
select company, avg(value) from my_db GROUP BY company;
select company, avg(value) from my_db
where timestamp > (CURRENT_DATE - INTERVAL '12 months')
GROUP BY company;
WITH last_year AS (
SELECT company, avg(value), 'year' AS range -- optional tag
FROM tbl
WHERE timestamp >= now() - interval '1 year'
GROUP BY 1
HAVING count(*) >= 20 -- 20+ rows in range
)
SELECT company, avg(value), 'all' AS range
FROM tbl
WHERE NOT EXISTS (SELECT FROM last_year WHERE company = t.company)
GROUP BY 1
UNION ALL TABLE last_year;
db<>fiddle here
An index on (timestamp) will only be used if your table is big and holds many years.
If most companies have 20+ rows in range, an index on (company) will be used for the 2nd SELECT to retrieve the few outliers.
Use conditional aggregation:
select company,
case
when sum(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
If by 20 datapoints you mean 20 rows in the last 12 months for each company, then:
select company,
case
when count(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
You can use window functions to provide the information for filtering:
select company, avg(value),
(count(*) = cnt_this_year) as only_this_year
from (select t.*,
count(*) filter (where date_trunc('year', datecol) = date_trunc('year', now()) over (partition by company) as cnt_this_year
from t
) t
where cnt_this_year >= 20 and date_trunc('year', datecol) = date_trunc('year', now()) or
cnt_this_year < 20
group by company;
The third column specifies if all the rows are from this year. By filtering in the where clause, it is simple to add other calculations as well (such as min(), max(), and so on).

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

SQL query needed - Counting 365 days backwards

I have searched the forum many times but couldn't find a solution for my situation. I am working with an Oracle database.
I have a table with all Order Numbers and Customer Numbers by Day. It looks like this:
Day | Customer Nbr | Order Nbr
2018-01-05 | 25687459 | 256
2018-01-09 | 36478592 | 398
2018-03-07 | 25687459 | 1547
and so on....
Now I need a SQL Query which gives me a table by day and Customer Nbr and counts the number of unique Order Numbers within the last 365 days starting from column 1.
For the example above the resulting table should look like:
Day | Customer Nbr | Order Cnt
2019-01-01 | 25687459 | 2
2019-01-02 | 25687459 | 2
...
2019-03-01 | 25687459 | 1
One method is to generate values for all days of interest for each customer and then use a correlated subquery:
with dates as (
select date '2019-01-01' + rownum as dte from dual
connect by date '2019-01-01' + rownum < sysdate
)
select d.dte, t.customer_nbr,
(select count(*)
from t t2
where t2.customer_nbr = t.customer_nbr and
t2.day <= t.dte and
t2.date > t.dte - 365
) as order_cnt
from dates d cross join
(select distinct customer_nbr from t) ;
Edit:
I've just seen you clarify the question, which I've interpreted to mean:
For every day in the last year, show how many orders there were for each customer between that date, and 1 year previously. Working on an answer now...
Updated Answer:
For each customer, we count the number of records between the order day, and 365 days before it...
WITH yourTable AS
(
SELECT SYSDATE - 1 Day, 'Alex' CustomerNbr FROM DUAL
UNION ALL
SELECT SYSDATE - 2, 'Alex' FROM DUAL
UNION ALL
SELECT SYSDATE - 366, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 400, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 500, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Joe'FROM DUAL
UNION ALL
SELECT SYSDATE - 300, 'Chris'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Chris'FROM DUAL
)
SELECT Day, CustomerNbr, OrdersLast365Days
FROM yourTable t
OUTER APPLY
(
SELECT COUNT(1) OrdersLast365Days
FROM yourTable t2
WHERE t.CustomerNbr = t2.CustomerNbr
AND TRUNC(t2.Day) >= TRUNC(t.Day) - 364
AND TRUNC(t2.Day) <= TRUNC(t.Day)
)
ORDER BY t.Day DESC, t.CustomerNbr;
If you want to report on just the days you have orders for, then a simple WHERE clause should be enough:
SELECT Day, CustomerNbr, COUNT(1) OrderCount
FROM <yourTable>
WHERE TRUNC(DAY) >= TRUNC(SYSDATE -364)
GROUP BY Day, CustomerNbr
ORDER BY Day Desc;
If you want to report on every day, you'll need to generate them first. This can be done by a recursive CTE, which you then join to your table:
WITH last365Days AS
(
SELECT TRUNC (SYSDATE - ROWNUM + 1) dt
FROM DUAL CONNECT BY ROWNUM < 365
)
SELECT d.Day, COALESCE(t.CustomerNbr, 'None') CustomerNbr, SUM(CASE WHEN t.CustomerNbr IS NULL THEN 0 ELSE 1 END) OrderCount
FROM last365Days d
LEFT OUTER JOIN <yourTable> t
ON d.Day = TRUNC(t.Day)
GROUP BY d.Day, t.CustomerNbr
ORDER BY d.Day Desc;
I would probably have done it with and analytic function. In your windowing clause, you can specify a number of rows before, or a range. In this case I will use a range.
This will give you, For Each customer for each day the number of orders during one rolling year before the date displayed
WITH DATES AS (
SELECT * FROM
(SELECT TRUNC(SYSDATE)-(LEVEL-1) AS DAY FROM DUAL CONNECT BY TRUNC(SYSDATE)-(LEVEL-1) >= ( SELECT MIN(TRUNC(DAY)) FROM MY_TABLE ))
CROSS JOIN
(SELECT DISTINCT CUST_ID FROM MY_TABLE))
SELECT DISTINCT
DATES.DAY,
DATES.CUST_ID,
COUNT(ORDER_ID) OVER (PARTITION BY DATES.CUST_ID ORDER BY DATES.DAY RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND INTERVAL '1' SECOND PRECEDING)
FROM
DATES
LEFT JOIN
MY_TABLE
ON DATES.DAY=TRUNC(MY_TABLE.DAY) AND DATES.CUST_ID=MY_TABLE.CUST_ID
ORDER BY DATES.CUST_ID,DATES.DAY;

Combining multiple scalar bigquery queries into a single query to generate one table

I have a BiqQuery query that basically takes a date as a parameter and calculates the number of active users our app had near that date.
Right now, if I want to make a graph over a year of active users, I have to run the query 12 times (once per month) and collate the results manually, which is error-prone and time consuming.
Is there a way to make a single bigquery query that runs the subquery 12 times and puts the results on 12 different rows?
For example, if my query is
SELECT COUNT(*) FROM MyTable WHERE activityTime < date '2017-01-01'
How can I get a table like
| Date | Count |
|------------|---------|
| 2017-01-01 | 50000 |
| 2017-02-01 | 40000 |
| 2017-03-01 | 30000 |
| 2017-04-01 | 20000 |
| 2017-05-01 | 10000 |
Supposing that you have a column called date and one called user_id and you want to calculate distinct users on a monthly basis, you can run a query such as:
#standardSQL
SELECT
DATE_TRUNC(date, MONTH) AS month,
COUNT(DISTINCT user_id) AS distinct_users
FROM YourTable
GROUP BY month
ORDER BY month ASC;
(Here you can replace YourTable with the subquery that you want to run). As a self-contained example:
#standardSQL
WITH YourTable AS (
SELECT DATE '2017-06-25' AS date, 10 AS user_id UNION ALL
SELECT DATE '2017-05-04', 11 UNION ALL
SELECT DATE '2017-06-20', 10 UNION ALL
SELECT DATE '2017-04-01', 11 UNION ALL
SELECT DATE '2017-06-02', 12 UNION ALL
SELECT DATE '2017-04-13', 10
)
SELECT
DATE_TRUNC(date, MONTH) AS month,
COUNT(DISTINCT user_id) AS distinct_users
FROM YourTable
GROUP BY month
ORDER BY month ASC;
Elliot taught me UNION ALL and it seemed to do the trick:
SELECT COUNT(*) FROM MyTable WHERE activityTime < date '2017-01-01'
UNION ALL
SELECT COUNT(*) FROM MyTable WHERE activityTime < date '2017-02-01'
UNION ALL
SELECT COUNT(*) FROM MyTable WHERE activityTime < date '2017-03-01'
Maybe there's a nicer way to parameterize the dates in the WHERE clause, but this did the trick for me.

Querying for an ID that has the most number of reads

Suppose I have a table like the one below:
+----+-----------+
| ID | TIME |
+----+-----------+
| 1 | 12-MAR-15 |
| 2 | 23-APR-14 |
| 2 | 01-DEC-14 |
| 1 | 01-DEC-15 |
| 3 | 05-NOV-15 |
+----+-----------+
What I want to do is for each year ( the year is defined as DATE), list the ID that has the highest count in that year. So for example, ID 1 occurs the most in 2015, ID 2 occurs the most in 2014, etc.
What I have for a query is:
SELECT EXTRACT(year from time) "YEAR", COUNT(ID) "ID"
FROM table
GROUP BY EXTRACT(year from time)
ORDER BY COUNT(ID) DESC;
But this query just counts how many times a year occurs, how do I fix it to highest count of an ID in that year?
Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 2 |
| 2012 | 2 |
+------+----+
Expected Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 1 |
| 2014 | 2 |
+------+----+
Starting with your sample query, the first change is simply to group by the ID as well as by the year.
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
ORDER BY EXTRACT(year from time) DESC, COUNT(*) DESC
With that, you could find the rows you want by visual inspection (the first row for each year is the ID with the most rows).
To have the query just return the rows with the highest totals, there are several different ways to do it. You need to consider what you want to do if there are ties - do you want to see all IDs tied for highest in a year, or just an arbitrary one?
Here is one approach - if there is a tie, this should return just the lowest of the tied IDs:
WITH groups AS (
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
)
SELECT year, MIN(id) KEEP (DENSE_RANK FIRST ORDER BY total DESC)
FROM groups
GROUP BY year
ORDER BY year DESC
You need to count per id and then apply a RANK on that count:
SELECT *
FROM
(
SELECT EXTRACT(year from time) "YEAR" , ID, COUNT(*) AS cnt
, RANK() OVER (PARTITION BY "YEAR" ORDER BY COUNT(*) DESC) AS rnk
FROM table
GROUP BY EXTRACT(year from time), ID
) dt
WHERE rnk = 1
If this return multiple rows with the same high count per year and you want just one of them randomly, you can switch to a ROW_NUMBER.
This should do what you're after, I think:
with sample_data as (select 1 id, to_date('12/03/2015', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('23/04/2014', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('01/12/2014', 'dd/mm/yyyy') time from dual union all
select 1 id, to_date('01/12/2015', 'dd/mm/yyyy') time from dual union all
select 3 id, to_date('05/11/2015', 'dd/mm/yyyy') time from dual)
-- End of creating a subquery to mimick a table called "sample_data" containing your input data.
-- See SQL below:
select yr,
id most_frequent_id,
cnt_id_yr cnt_of_most_freq_id
from (select to_char(time, 'yyyy') yr,
id,
count(*) cnt_id_yr,
dense_rank() over (partition by to_char(time, 'yyyy') order by count(*) desc) dr
from sample_data
group by to_char(time, 'yyyy'),
id)
where dr = 1;
YR MOST_FREQUENT_ID CNT_OF_MOST_FREQ_ID
---- ---------------- -------------------
2014 2 2
2015 1 2