Hi I have a table that looks like the following
grouping_coulmn
value
date_modified
1
5
2020-10-15
1
10
2020-10-20
2
3
2020-10-20
1
11
2020-11-30
1
11
2020-12-10
1
5
2020-12-15
How could I make a query that returns the following results
grouping_column
last_value_of_month
month
1
10
OCT 2020
1
11
NOV 2020
1
5
DIC 2020
1
5
JAN 2021
2
3
OCT 2020
2
3
NOV 2020
2
3
DIC 2020
2
3
JAN 2021
In other words it should return the last value of the group each month, from the first entry until the current month. I could work it out if you don't fill the missing months, but I don't know how to work that out.
NOTE: this question was asked on January 2021, just for context.
First, generate all the months based on the oldest date in the table:
with months as (
select ddate + interval '1 month' as end_date,
to_char(ddate, 'MON YYYY') as month
from generate_series(
date_trunc(
'month',
(select min(date_modified) from table1)
),
now(),
interval '1 month'
) as gs(ddate)
)
Join that back to your data table, and use distinct on to limit the result to one record per (grouping_column, month):
select distinct on (t.grouping_column, m.end_date)
t.grouping_column, t.value as last_value_of_month, m.month
from months m
join table1 t
on t.date_modified < m.end_date
order by t.grouping_column, m.end_date, t.date_modified desc;
Result:
grouping_column | last_value_of_month | month
--------------: | ------------------: | :-------
1 | 10 | OCT 2020
1 | 11 | NOV 2020
1 | 5 | DEC 2020
1 | 5 | JAN 2021
2 | 3 | OCT 2020
2 | 3 | NOV 2020
2 | 3 | DEC 2020
2 | 3 | JAN 2021
db<>fiddle here
Related
I have the following example table "invoices".
+----------------+-------------+--------+
| invoice_number | date | status |
+----------------+-------------+--------+
| 1 | 2 apr 2021 | 1 |
+----------------+-------------+--------+
| 2 | 9 apr 2021 | 0 |
+----------------+-------------+--------+
| 3 | 9 apr 2021 | 1 |
+----------------+-------------+--------+
| 4 | 9 apr 2021 | 1 |
+----------------+-------------+--------+
| 5 | 16 apr 2021 | 1 |
+----------------+-------------+--------+
| 6 | 16 apr 2021 | 0 |
+----------------+-------------+--------+
| 7 | 16 apr 2021 | 0 |
+----------------+-------------+--------+
| 8 | 16 apr 2021 | 0 |
+----------------+-------------+--------+
| 9 | 16 apr 2021 | 1 |
+----------------+-------------+--------+
(in status, 1 is for paid, 0 for unpaid)
and from it I'm trying to get the following:
the number of invoices per week (each date is a week so a group per date works)
the percentage of these invoices that were paid (grouped by date)
I was trying to use a window function to organize better (since I have several more fields, this is just simplified for the example)
and I was doing
select date,
count(invoice_number) over (partition by date) as NumberOfInvoices,
(sum(status)/count(status) over (partition by date))*100 as percentagePaid
from invoices
but of course this is not working, at the same time I'm getting all the rows for the table as result, instead of them grouped by date.
should I stop trying to use the over partition by here? or am I just applying it incorrectly for what I need?
the percentage of these invoices that were paid (grouped by date)
This would simply be aggregation:
select date, avg(status * 1.0) as paid_ratio
from invoices i
group by date;
If you wanted this per row, then you would use window functions:
select i.*,
avg(i.status * 1.0) over (partition by i.date) as paid_ratio
from invoices i;
Note the * 1.0. SQL Server does integer division -- and averages -- on integers. status looks like an integer, so the * 1.0 converts it to a number with decimal places.
I have these two tables (times and sales):
times
TIME_ID | DAY_NAME | DAY_NUMBER_IN_WEEK | CALENDAR_MONTH_NAME | CALENDAR_MONTH_ID
1998-01-10| Monday | 1 | January | 1684
1998-01-10| Tuesday | 2 | January | 1684
1998-01-10| Wednesday | 3 | January | 1684
...
1998-01-11| Monday | 1 | February | 1685
1998-01-11| Tuesday | 2 | January | 1685
1998-01-11| Wednesday | 3 | January | 1685
sales
PROD_ID | TIME_ID | AMOUNT_SOLD
13 | 1998-01-10 | 1232
13 | 1998-01-11 | 1233
14 | 1998-01-11 | 1233
I need to make columns for every day in week (Monday, Tuesday, Wednesday...) and SUM of AMOUNT_SOLD for each PROD_ID for each day for each month.
SELECT SUM(times.day_number_in_week), times.calendar_month_name, times.day_name, times.calendar_year
FROM sales
INNER JOIN times ON times.time_id = sales.time_id
GROUP BY times.calendar_month_number, times.calendar_month_name, times.day_name, times.calendar_year
Output:
5988 March Wednesday 1998
9408 April Thursday 1998
7532 June Sunday 1998
9220 July Thursday 1998
7490 July Sunday 1998
12540 August Saturday 1998
but this sum all Wednesdays for all years, i need sum of amount for 1 month for all days (Wednesday, Monday...) for one month.
Can you help me?
You can do conditional aggregation:
SELECT
t.calendar_year
t.calendar_month_name,
SUM(case when t.day_number_in_week = 1 then s.amount_sold else 0 end) amount_sold_mon,
SUM(case when t.day_number_in_week = 2 then s.amount_sold else 0 end) amount_sold_tue,
SUM(case when t.day_number_in_week = 3 then s.amount_sold else 0 end) amount_sold_wed,
...
FROM sales s
INNER JOIN times t ON t.time_id = sales.time_id
GROUP BY t.calendar_year, t.calendar_month_number, t.calendar_month_name
I need to count the number of products that existed in inventory by date. In the database however, a product is only recorded when it was viewed by a consumer.
For example consider this basic table structure:
date | productId | views
July 1 | A | 8
July 2 | A | 6
July 2 | B | 4
July 3 | A | 2
July 4 | A | 8
July 4 | B | 6
July 4 | C | 4
July 5 | C | 2
July 10 | A | 17
Using the following query, I attempt to determine the amount of products in inventory on a given date.
select date, count(distinct productId) as Inventory, sum(views) as views
from (
select date, productId, count(*) as views
from SomeTable
group by date, productID
order by date asc, productID asc
)
group by date
This is the output
date | Inventory | views
July 1 | 1 | 8
July 2 | 2 | 10
July 3 | 1 | 2
July 4 | 3 | 18
July 5 | 1 | 2
July 10 | 1 | 17
My output is not an accurate reflection of how many products were in inventory due to missing rows.
The correct understanding of inventory is as follows:
- Product A was present in inventory from July 1 - July 10.
- Product B was present in inventory from July 2 - July 4.
- Product C was in inventory from July 4 - July 5.
The correct SQL output should be:
date | Inventory | views
July 1 | 1 | 8
July 2 | 2 | 10
July 3 | 2 | 2
July 4 | 3 | 18
July 5 | 2 | 2
July 6 | 1 | 0
July 7 | 1 | 0
July 8 | 1 | 0
July 9 | 1 | 0
July 10 | 1 | 17
If you are following along, let me confirm that I am comfortable defining "in inventory" as the date difference between the first & last view.
I have followed the following faulty process:
First I created a table which was the cartesian product of every productID & every date.
'''
with Dates as (
select date
from SomeTable
group by date
),
Products as (
select productId
from SomeTable
group by productId
)
select Dates.date, Products.productId
from Dates cross join Products
'''
Then I attempted do a right outer join to reduce this to just the missing records:
with Records as (
select date, productId, count(*) as views
from SomeTable
group by date, productId
),
Cartesian as (
{See query above}
)
Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union
from Cartesian right outer join Records
on Cartesian.date = Records.date
where Records.productId is null
Then with the missing rows in hand, union them back onto the Records.
in doing so, I create a new problem: extra rows.
date | productId | views
July 1 | A | 8
July 1 | B | 0
July 1 | C | 0
July 2 | A | 6
July 2 | B | 4
July 2 | C | 0
July 3 | A | 2
July 3 | B | 0
July 3 | C | 0
July 4 | A | 8
July 4 | B | 6
July 4 | C | 4
July 5 | A | 2
July 5 | B | 0
July 5 | C | 0
July 6 | A | 0
July 6 | B | 0
July 6 | C | 0
July 7 | A | 0
July 7 | B | 0
July 7 | C | 0
July 8 | A | 0
July 8 | B | 0
July 8 | C | 0
July 9 | A | 0
July 9 | B | 0
July 9 | C | 0
July 10 | A | 17
July 10 | B | 0
July 10 | C | 0
And when I run my simple query
select date, count(distinct productId) as Inventory, sum(views) as views
on that table I get the wrong output again:
date | Inventory | views
July 1 | 3 | 8
July 2 | 3 | 10
July 3 | 3 | 2
July 4 | 3 | 18
July 5 | 3 | 2
July 6 | 3 | 0
July 7 | 3 | 0
July 8 | 3 | 0
July 9 | 3 | 0
July 10 | 3 | 17
My next thought would be to iterate through each productId, determine it's first & last date, then Union that with the Cartesian table with the condition that the Cartesian.date falls between the first & last date for each specific product.
There's got to be an easier way to do this. Thanks.
Below is for BigQuery Standard SQL
#standardSQL
WITH dates AS (
SELECT day FROM (
SELECT MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
SELECT productId, MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table` t
GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day
GROUP BY day
If to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL
SELECT '2019-07-02', 'A', 6 UNION ALL
SELECT '2019-07-02', 'B', 4 UNION ALL
SELECT '2019-07-03', 'A', 2 UNION ALL
SELECT '2019-07-04', 'A', 8 UNION ALL
SELECT '2019-07-04', 'B', 6 UNION ALL
SELECT '2019-07-04', 'C', 4 UNION ALL
SELECT '2019-07-05', 'C', 2 UNION ALL
SELECT '2019-07-10', 'A', 17
), dates AS (
SELECT day FROM (
SELECT MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
SELECT productId, MIN(day) min_day, MAX(day) max_day
FROM `project.dataset.table` t
GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day
GROUP BY day
-- ORDER BY day
result is
Row day Inventory views
1 2019-07-01 1 8
2 2019-07-02 2 10
3 2019-07-03 2 2
4 2019-07-04 3 18
5 2019-07-05 2 2
6 2019-07-06 1 0
7 2019-07-07 1 0
8 2019-07-08 1 0
9 2019-07-09 1 0
10 2019-07-10 1 17
So I have a table like this (simplified):
id| Country_dsc | Year | Month | Quantity | Value |
1 | Armenia | 2019 | 2 | 4 | 2 |
2 | Armenia | 2019 | 3 | 6 | 4 |
3 | Armenia | 2018 | 1 | 6 | 5 |
4 | Armenia | 2018 | 2 | 3 | 3 |
5 | Armenia | 2018 | 3 | 7 | 5 |
And I would like to have a result like this:
Name | YTD_Quantity_Y | YTD_Quantity_LY | YTD_Value_Y | YTD_Value_LY |
Armenia | 10 | 16 | 6 | 13 |
with YTD_Quantity_Y being the sum of all quantity of 2019 and YTD_Quantity_LY the sum of all the quantity of 2018 from the begining of the year until the current month (in this example, March). Same logic for the Value.
So what I tried was:
SELECT t1.Country_Dsc as Name,
SUM(t1.Quantity) as YTD_Quantity_Y, -- January, February, March 2019
SUM(t2.Quantity) as YTD_Quantity_LY -- January, February, March 2018
SUM(t2.Value) as YTD_Value_Y -- January, February, March 2019
SUM(t2.Value) as YTD_Value_LY -- January, February, March 2018
FROM example_table t1
LEFT JOIN example_table t2 on t1.Country_Dsc = t2.Country_Dsc
AND t1.Year = 2018
AND t1.Month = t2.Month
WHERE t1.Year = 2019
and t1.Month <= 3 -- in this case I want all data from January to March for 2019 and 2018
GROUP BY t1.Country_Dsc
The problem is that since 2019 have no record for January, I don't get the quantity of January 2018 in YTD_Quantity_LY.
If I start from 2018 and join on 2019 it works but sometimes I have the case where it's for 2018 that I don't have the record for a month so it'll not show for 2019 (YTD_Quantity_Y).
Is-it possible to have the result I desire without using a query for each year?
Try below query:
declare #tbl table (id int, Country_dsc varchar(10), [Year] int, [Month] int, Quantity int, [Value] int );
insert into #tbl values
(1 , 'Armenia' , 2019 , 2 , 4 , 2 ),
(2 , 'Armenia' , 2019 , 3 , 6 , 4 ),
(3 , 'Armenia' , 2018 , 1 , 6 , 5 ),
(4 , 'Armenia' , 2018 , 2 , 3 , 3 ),
(5 , 'Armenia' , 2018 , 3 , 7 , 5 )
select Country_dsc [Name],
sum(case when year = 2019 then quantity else 0 end) YTD_Quantity_Y ,
sum(case when year = 2018 then quantity else 0 end) YTD_Quantity_LY ,
sum(case when year = 2019 then Value else 0 end) YTD_Value_Y ,
sum(case when year = 2018 then Value else 0 end) YTD_Value_LY
from #tbl
group by Country_dsc
I'm trying to write a script which returns a list of months with the number of days in the month. It references this table
CREATE TABLE generic.time_series_only (measurementdatetime TIMESTAMP WITHOUT TIME ZONE NOT NULL)
which is just a chronological time series (and very useful when joining tables of data with gaps in different places, but you want an unbroken timeseries as your output, maybe there's a smarter way to do that but I haven't found it yet).
SELECT date_part('year'::text, time_series_only.measurementdatetime) AS
measyear,
date_part('month'::text, time_series_only.measurementdatetime) AS
measmonth,
date_trunc('month'::text, time_series_only.measurementdatetime) +
'1 mon'::interval - date_trunc('month'::text,
time_series_only.measurementdatetime) AS days_in_month
FROM generic.time_series_only
GROUP BY date_part('year'::text, time_series_only.measurementdatetime),
date_part('month'::text, time_series_only.measurementdatetime)
ORDER BY date_part('year'::text, time_series_only.measurementdatetime),
date_part('month'::text, time_series_only.measurementdatetime);
But I get this error:
ERROR: column "time_series_only.measurementdatetime" must appear in the GROUP BY clause or be used in an aggregate function
I can't put this column in the GROUP BY clause because then I'd get a result for every single entry in the time_series_only table, and I can't figure a way to get the same result using an aggregate function? Any suggestions very welcome :-)
you not using generate_series?.. like here:
vao=# with pre as (select generate_series('2016-01-01','2017-03-31','1 day'::interval) g) select distinct
extract('year' from g), extract('month' from g), count(1) over (partition by date_trunc('month',g)) from pre order by 1,2;
date_part | date_part | count
-----------+-----------+-------
2016 | 1 | 31
2016 | 2 | 29
2016 | 3 | 31
2016 | 4 | 30
2016 | 5 | 31
2016 | 6 | 30
2016 | 7 | 31
2016 | 8 | 31
2016 | 9 | 30
2016 | 10 | 31
2016 | 11 | 30
2016 | 12 | 31
2017 | 1 | 31
2017 | 2 | 28
2017 | 3 | 31
(15 rows)
Use distinct on a pair (year, month). You can replace the time_series_only table with the function generate_series() , e.g.:
select distinct on (date_part('year', d), date_part('month', d))
date_part('year', d) as year,
date_part('month', d) as month,
date_part('day', d) as days_in_month
from
generate_series('2016-01-01'::date, '2016-12-31'::date, '1d'::interval) d
order by 1, 2, 3 desc;
year | month | days_in_month
------+-------+---------------
2016 | 1 | 31
2016 | 2 | 29
2016 | 3 | 31
2016 | 4 | 30
2016 | 5 | 31
2016 | 6 | 30
2016 | 7 | 31
2016 | 8 | 31
2016 | 9 | 30
2016 | 10 | 31
2016 | 11 | 30
2016 | 12 | 31
(12 rows)
This one has better performance since it generates only the last day for each month and consequently does not need aggregation:
select
date_part('year', d) as year,
date_part('month', d) as month,
date_part('day', d) as days_in_month
from
generate_series('2016-01-01'::date, '2016-12-01', '1 month') gs(gsd)
cross join lateral
(select gsd + interval '1 month - 1 day') d(d)
order by 1, 2;
year | month | days_in_month
------+-------+---------------
2016 | 1 | 31
2016 | 2 | 29
2016 | 3 | 31
2016 | 4 | 30
2016 | 5 | 31
2016 | 6 | 30
2016 | 7 | 31
2016 | 8 | 31
2016 | 9 | 30
2016 | 10 | 31
2016 | 11 | 30
2016 | 12 | 31
Another variation, using CTEs for a bit more readability, IMHO (this example generating months and datas for next threee full months following the calendar month of current_date)
WITH series AS (
SELECT generate_series (
date_trunc ('month', date_trunc('day', now()) + interval '1 month'),
date_trunc('day', now() + interval '4 months'), '1d'::interval
) AS day ) SELECT DISTINCT ON (date_part('year', series.day), date_part('month', series.day))
date_part('year', series.day) as year,
date_part('month', series.day) as month,
date_part('day', series.day) as days_in_month
FROM series
ORDER BY 1, 2, 3 desc LIMIT 3;
year | month | days_in_month
------+-------+---------------
2021 | 1 | 31
2021 | 2 | 28
2021 | 3 | 31