Find the average of ids in a month - sql

I can calculate the number of ids in a month and then sum it up over 12 months. I also get the average using this code.
select id, to_char(event_month, 'yyyy') event_year, sum(cnt) overall_count, avg(cnt) average_count
from (
select id, trunc(event_date, 'month') event_month, count(*) cnt
from daily
where event_date >= date '2019-01-01' and event_date < '2019-01-31'
group by id, trunc(event_date, 'month')
) t
group by id, to_char(event_month, 'yyyy')
The results looks something like this:
ID| YEAR | OVER_ALL_COUNT| AVG
1| 2019 | 712 | 59.33
2| 2019 | 20936849 | 161185684.6
3| 2019 | 14255773 | 2177532.2
However, I want to modify this to get the over all id counts for a month instead and the average of the id counts per month. Desired result is:
ID| MONTH | OVER_ALL_COUNT| AVG
1| Jan | 152 | 10.3
2| Jan | 15000 | 1611
3| Jan | 14255 | 2177
1| Feb | 4300 | 113
2| Feb | 9700 | 782
3| Feb | 1900 | 97
where January has 152 id counts over all for id=1, and the average id count per day is 10.3. For id=2, the january count is 15000 and the average id=2 count for jan is 1611.

You just need to change the truncating on your subquery to truncate by day instead of by month, then truncate the outer query by month instead of year.
select id, to_char(event_day, 'Mon') event_month, sum(cnt) overall_count, avg(cnt) average_count
from (
select id, trunc(event_date) event_day, count(*) cnt
from daily
where event_date >= date '2019-01-01' and event_date < date '2019-01-31'
group by id, trunc(event_date)
) t
group by id, to_char(event_month, 'Mon')

This answers the original version of the question.
You can use last_day():
select id, to_char(event_month, 'yyyy') event_year, sum(cnt) as overall_count,
avg(cnt) as average_count,
extract(day from last_day(min(event_month)) as days_in_month,
sum(cnt) / extract(day from last_day(min(event_month)) as avg_days_in_month
from (select id, trunc(event_date, 'month') as event_month, count(*) as cnt
from daily
where event_date >= date '2019-01-01' and
event_date < date '2020-01-01'
group by id, trunc(event_date, 'month')
) t
group by id, to_char(event_month, 'yyyy')

Related

How to lag over month and over year at the same time?

I am attempting to find percentage change per each month per company.
Table
Year Mon company_id Revenue
2018 2018-06-01 42 2000
2018 2018-07-01 42 3000
2019 2019-06-01 42 4000
2019 2019-07-01 42 9000
I attempted this and failed.
select *, lag(Revenue) over(partition by company_id order by Year) from table
working to get the bellow result ( the table has multiple company_ids)
Year Mon company_id Revenue percentage change
2018 2018-06-01 42 2000
2018 2018-07-01 42 3000
2019 2019-06-01 42 4000 100
2019 2019-07-01 42 9000 200
Woof!
You need to tickle out the month so that you can partition by it:
WITH subq AS (
SELECT year, mon, company_id, revenue,
lag(revenue) OVER (PARTITION BY company_id, extract(month FROM mon)
ORDER BY year) AS prev_revenue
FROM "table"
)
SELECT year, mon, company_id, revenue,
(revenue - prev_revenue) * 100.0 / prev_revenue AS percent_change
FROM subq;
This assumes that the date stored in mon only serves to identify the month, and that the "year" part of it (which is different from year) is irrelevant.
You may use a window function to get exact the row with offset one year in the past.
max(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) revenue_lag
This will work even if you have a missing data for some year (you will not match a two or more years old revenue).
Unfortunately this does not work for me in 14.6 with LAG (not sure if per design), so I'm using MAX.
Example
with tab as (
select * from (values
(2018,date'2018-06-01', 42, 2000),
(2018,date'2018-07-01', 42, 3000),
(2018,date'2018-07-15', 42, 3020),
(2019,date'2019-06-01', 42, 4000),
(2019,date'2019-07-01', 42, 9000)
) tab(year, mon, company_id, revenue)
)
select tab.*,
max(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) revenue_lag,
lag(revenue) over (partition by company_id order by mon range between '1 year' PRECEDING and '1 year' PRECEDING ) lag
from tab
;
year|mon |company_id|revenue|revenue_lag|lag |
----+----------+----------+-------+-----------+----+
2018|2018-06-01| 42| 2000| | |
2018|2018-07-01| 42| 3000| |2000|
2018|2018-07-15| 42| 3020| |3000|
2019|2019-06-01| 42| 4000| 2000|3020|
2019|2019-07-01| 42| 9000| 3000|4000|

How to SQL Query Last Day of the Month that has transaction?

My set of data are as follows:
+------+-------+-----+--------+
| Year | Month | Day | Amount |
+------+-------+-----+--------+
| 2019 | 01 | 01 | 10 |
| 2019 | 01 | 15 | 30 |
| 2019 | 01 | 29 | 40 |
| 2019 | 02 | 02 | 50 |
| 2019 | 02 | 22 | 60 |
| 2019 | 03 | 11 | 70 |
| 2019 | 03 | 31 | 80 |
+------+-------+-----+--------+
I just want to see the last record day of each month that has transaction.
My preferred result shown should look like this:
+------+-------+--------+
| Year | Month | Amount |
+------+-------+--------+
| 2019 | 01 | 40 |
| 2019 | 02 | 60 |
| 2019 | 03 | 80 |
+------+-------+--------+
For each combination of Year and Month, you want to get the maximum Day and Amount values:
SELECT Year, Month, max(Day) as Day, max(Amount) as Amount
FROM t
GROUP BY Year, Month;
Note that every column appearing in the SELECT clause but not in the GROUP BY clause must be aggregated (with max here).
That is, assuming Amount corresponds to the total per day, which is what your example suggests.
If your table contains more than one Amount per day, then you also need to sum up the amounts per day. I'd use something like:
SELECT Year, Month, max(Day) as Day, max(Amount) as Amount
FROM (
SELECT Year, Month, Day, sum(Amount) as Amount
FROM t
GROUP BY Year, Month, Day
) as tmp
GROUP BY Year, Month;
test me (with one more amount added to your example)
Or:
SELECT Year, Month, Day, sum(Amount) as Amount
FROM (
SELECT *, rank() over(partition by Year, Month order by Day desc) as r
FROM t
) as tmp
WHERE r = 1
GROUP BY Year, Month, Day;
Mind that you want to use rank() and not row_number() in here, as you need to give the same rank identifier to ties (same day).
Of course if you want you can wrap any of the queries above with:
SELECT Year, Month, Amount
FROM (<query>) as q;
to get rid of the day column.
You can use row number to achieve this:
SELECT [Year], [Month], [Amount]
FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY CONCAT(YEAR,MONTH) ORDER BY DAY DESC) rn, *
FROM table) t
WHERE rn = 1
One method is a correlated subquery:
select t.*
from t
where t.day = (select max(t2.day)
from t t2
where t2.year = t.year and t2.month = t.month
);
Another common approach uses row_number():
select t.*
from (select t.*,
row_number() over (partition by year, month order by day desc) as seqnum
from t
) t
where seqnum = 1;
you can QUALIFY use below:
SELECT ROW_NUMBER() OVER(PARTITION BY YEAR, MONTH ORDER BY DAY DESC) as rn, *
FROM yourtable
QUALIFY rn = 1
Using ROW_NUMBER() function:
with cte_order as (select year, month, amount,row_number() over(partition by month order by day desc) as identity_row from test1)
select year, month, amount from cte_order where identity_row = 1;
Using QUALIFY() function:
select year, month, amount, day, row_number() over(partition by month order by day desc) as identity_row from test1 qualify identity_row = 1;
select year,month,amount from(SELECT year,month,max(day),amount FROM t GROUP BY year,month);

Get detail days between two date (mysql query)

I have data like this:
id | start_date | end_date
----------------------------
1 | 16-09-2019 | 22-12-2019
I want to get the following results:
id | month | year | days
------------------------
1 | 09 | 2019 | 15
1 | 10 | 2019 | 31
1 | 11 | 2019 | 30
1 | 12 | 2019 | 22
Is there a way to get that result ?
This is what you want to do:
SELECT id, EXTRACT(MONTH FROM start_date ) as month , EXTRACT(YEAR FROM start_date ) as year , DATEDIFF(end_date, start_date ) as days
From tbl
You can use MONTH() , YEAR() and DATEDIFF() functions
SELECT id, MONTH(start_date) as month, YEAR(start_date ) as year, DATEDIFF(end_date, start_date ) as days from table-name
One way is to create a Calendar table and use that.
select month,year, count(*)
from Calendar
where db_date between '2019-09-16'
and '2019-12-22'
group by month,year
CHECK DEMO HERE
Also you can use recursive CTE to achieve the same.
You can use a recursive CTE and aggregation:
with recursive cte as (
select id, start_date, end_date
from t
union all
select id, start_date + interval 1 day, end_date
from cte
where start_date < end_date
)
select id, year(start_date), month(start_date), count(*) as days
from cte
group by id, year(start_date), month(start_date);
Here is a db<>fiddle.

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

Querying for an ID that has the most number of reads

Suppose I have a table like the one below:
+----+-----------+
| ID | TIME |
+----+-----------+
| 1 | 12-MAR-15 |
| 2 | 23-APR-14 |
| 2 | 01-DEC-14 |
| 1 | 01-DEC-15 |
| 3 | 05-NOV-15 |
+----+-----------+
What I want to do is for each year ( the year is defined as DATE), list the ID that has the highest count in that year. So for example, ID 1 occurs the most in 2015, ID 2 occurs the most in 2014, etc.
What I have for a query is:
SELECT EXTRACT(year from time) "YEAR", COUNT(ID) "ID"
FROM table
GROUP BY EXTRACT(year from time)
ORDER BY COUNT(ID) DESC;
But this query just counts how many times a year occurs, how do I fix it to highest count of an ID in that year?
Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 2 |
| 2012 | 2 |
+------+----+
Expected Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 1 |
| 2014 | 2 |
+------+----+
Starting with your sample query, the first change is simply to group by the ID as well as by the year.
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
ORDER BY EXTRACT(year from time) DESC, COUNT(*) DESC
With that, you could find the rows you want by visual inspection (the first row for each year is the ID with the most rows).
To have the query just return the rows with the highest totals, there are several different ways to do it. You need to consider what you want to do if there are ties - do you want to see all IDs tied for highest in a year, or just an arbitrary one?
Here is one approach - if there is a tie, this should return just the lowest of the tied IDs:
WITH groups AS (
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
)
SELECT year, MIN(id) KEEP (DENSE_RANK FIRST ORDER BY total DESC)
FROM groups
GROUP BY year
ORDER BY year DESC
You need to count per id and then apply a RANK on that count:
SELECT *
FROM
(
SELECT EXTRACT(year from time) "YEAR" , ID, COUNT(*) AS cnt
, RANK() OVER (PARTITION BY "YEAR" ORDER BY COUNT(*) DESC) AS rnk
FROM table
GROUP BY EXTRACT(year from time), ID
) dt
WHERE rnk = 1
If this return multiple rows with the same high count per year and you want just one of them randomly, you can switch to a ROW_NUMBER.
This should do what you're after, I think:
with sample_data as (select 1 id, to_date('12/03/2015', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('23/04/2014', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('01/12/2014', 'dd/mm/yyyy') time from dual union all
select 1 id, to_date('01/12/2015', 'dd/mm/yyyy') time from dual union all
select 3 id, to_date('05/11/2015', 'dd/mm/yyyy') time from dual)
-- End of creating a subquery to mimick a table called "sample_data" containing your input data.
-- See SQL below:
select yr,
id most_frequent_id,
cnt_id_yr cnt_of_most_freq_id
from (select to_char(time, 'yyyy') yr,
id,
count(*) cnt_id_yr,
dense_rank() over (partition by to_char(time, 'yyyy') order by count(*) desc) dr
from sample_data
group by to_char(time, 'yyyy'),
id)
where dr = 1;
YR MOST_FREQUENT_ID CNT_OF_MOST_FREQ_ID
---- ---------------- -------------------
2014 2 2
2015 1 2