SQL select query using Joins with aggregate counts

SQL select query using Joins with aggregate counts - sql

I have a table with the following fields:
tickets: id, createddate, resolutiondate
A sample set of data has:
jira=# select * from tickets;
id | createddate | resolutiondate
---------+-------------+----------------
ticket1 | 2020-09-21 | 2020-10-01
ticket2 | 2020-09-22 | 2020-09-23
ticket3 | 2020-10-01 |
ticket4 | 2020-10-01 | 2020-10-04
ticket5 | 2020-10-01 |
ticket6 | 2020-10-01 | 2020-10-07
(6 rows)
jira=#
I would like to create a query which reports:
Week: Issues Created: Issues Resolved
I can do the two separate queries:
# select date_trunc('week', createddate) week, count(id) created
from tickets
group by week
order by week desc
;
week | created
------------------------+---------
2020-09-28 00:00:00+00 | 4
2020-09-21 00:00:00+00 | 2
(2 rows)
# select date_trunc('week', resolutiondate) week, count(id) resolved
from tickets
where resolutiondate is not NULL
group by week
order by week desc
;
week | resolved
------------------------+----------
2020-10-05 00:00:00+00 | 1
2020-09-28 00:00:00+00 | 2
2020-09-21 00:00:00+00 | 1
(3 rows)
However - I can not figure out how (with a join, union, sub-query, ...?) to combine these queries into a combined query with the appropriate aggregations.
I'm doing this is Postgres - any pointers would be appreciated.

Performing a union before aggregating values may work here eg
select week,
count(id_created) as created,
count(id_resolved) as resolved
from (
select date_trunc('week', resolutiondate) week, NULL as id_created, id as id_resolved from tickets UNION ALL
select date_trunc('week', createddate) week, id as id_created, NULL as id_resolved from tickets
) t
group by week
order by week desc
Let me know if this works for you.

Related

Question: Joining two data sets with date conditions

I'm pretty new with SQL, and I'm struggling to figure out a seemingly simple task.
Here's the situation:
I'm working with two data sets
Data Set A, which is the most accurate but only refreshes every quarter
Data Set B, which has all the date, including the most recent data, but is overall less accurate
My goal is to combine both data sets where I would have Data Set A for all data up to the most recent quarter and Data Set B for anything after (i.e., all recent data not captured in Data Set A)
For example:
Data Set A captures anything from Q1 2020 (January to March)
Let's say we are April 15th
Data Set B captures anything from Q1 2020 to the most current date, April 15th
My goal is to use Data Set A for all data from January to March 2020 (Q1) and then Data Set B for all data from April 1 to 15
Any thoughts or advice on how to do this? Potentially a join function along with a date one?
Any help would be much appreciated.
Thanks in advance for the help.

I hope I got your question right.
I put in some sample data that might match your description: a date and an amount. To keep it simple, one row per any month. You can extract the quarter from a date, and keep that as an additional column, and then filter by that down the line.
WITH
-- some sample data: date and amount ...
indata(dt,amount) AS (
SELECT DATE '2020-01-15', 234.45
UNION ALL SELECT DATE '2020-02-15', 344.45
UNION ALL SELECT DATE '2020-03-15', 345.45
UNION ALL SELECT DATE '2020-04-15', 346.45
UNION ALL SELECT DATE '2020-05-15', 347.45
UNION ALL SELECT DATE '2020-06-15', 348.45
UNION ALL SELECT DATE '2020-07-15', 349.45
UNION ALL SELECT DATE '2020-08-15', 350.45
UNION ALL SELECT DATE '2020-09-15', 351.45
UNION ALL SELECT DATE '2020-10-15', 352.45
UNION ALL SELECT DATE '2020-11-15', 353.45
UNION ALL SELECT DATE '2020-12-15', 354.45
)
-- real query starts here ...
SELECT
EXTRACT(QUARTER FROM dt) AS the_quarter
, CAST(
TIMESTAMPADD(
QUARTER
, CAST(EXTRACT(QUARTER FROM dt) AS INTEGER)-1
, TRUNC(dt,'YEAR')
)
AS DATE
) AS qtr_start
, *
FROM indata;
-- out the_quarter | qtr_start | dt | amount
-- out -------------+------------+------------+--------
-- out 1 | 2020-01-01 | 2020-01-15 | 234.45
-- out 1 | 2020-01-01 | 2020-02-15 | 344.45
-- out 1 | 2020-01-01 | 2020-03-15 | 345.45
-- out 2 | 2020-04-01 | 2020-04-15 | 346.45
-- out 2 | 2020-04-01 | 2020-05-15 | 347.45
-- out 2 | 2020-04-01 | 2020-06-15 | 348.45
-- out 3 | 2020-07-01 | 2020-07-15 | 349.45
-- out 3 | 2020-07-01 | 2020-08-15 | 350.45
-- out 3 | 2020-07-01 | 2020-09-15 | 351.45
-- out 4 | 2020-10-01 | 2020-10-15 | 352.45
-- out 4 | 2020-10-01 | 2020-11-15 | 353.45
-- out 4 | 2020-10-01 | 2020-12-15 | 354.45
If you filter by quarter, you can group your data by that column ...

How can I aggregate values based on an arbitrary monthly cycle date range in SQL?

Given a table as such:
# SELECT * FROM payments ORDER BY payment_date DESC;
id | payment_type_id | payment_date | amount
----+-----------------+--------------+---------
4 | 1 | 2019-11-18 | 300.00
3 | 1 | 2019-11-17 | 1000.00
2 | 1 | 2019-11-16 | 250.00
1 | 1 | 2019-11-15 | 300.00
14 | 1 | 2019-10-18 | 130.00
13 | 1 | 2019-10-18 | 100.00
15 | 1 | 2019-09-18 | 1300.00
16 | 1 | 2019-09-17 | 1300.00
17 | 1 | 2019-09-01 | 400.00
18 | 1 | 2019-08-25 | 400.00
(10 rows)
How can I SUM the amount column based on an arbitrary date range, not simply a date truncation?
Taking the example of a date range beginning on the 15th of a month, and ending on the 14th of the following month, the output I would expect to see is:
payment_type_id | payment_date | amount
-----------------+--------------+---------
1 | 2019-11-15 | 1850.00
1 | 2019-10-15 | 230.00
1 | 2019-09-15 | 2600.00
1 | 2019-08-15 | 800.00
Can this be done in SQL, or is this something that's better handled in code? I would traditionally do this in code, but looking to extend my knowledge of SQL (which at this stage, isnt much!)

Click demo:db<>fiddle
You can use a combination of the CASE clause and the date_trunc() function:
SELECT
payment_type_id,
CASE
WHEN date_part('day', payment_date) < 15 THEN
date_trunc('month', payment_date) + interval '-1month 14 days'
ELSE date_trunc('month', payment_date) + interval '14 days'
END AS payment_date,
SUM(amount) AS amount
FROM
payments
GROUP BY 1,2
date_part('day', ...) gives out the current day of month
The CASE clause is for dividing the dates before the 15th of month and after.
The date_trunc('month', ...) converts all dates in a month to the first of this month
So, if date is before the 15th of the current month, it should be grouped to the 15th of the previous month (this is what +interval '-1month 14 days' calculates: +14, because the date_trunc() truncates to the 1st of month: 1 + 14 = 15). Otherwise it is group to the 15th of the current month.
After calculating these payment_days, you can use them for simple grouping.

I would simply subtract 14 days, truncate the month, and add 14 days back:
select payment_type_id,
date_trunc('month', payment_date - interval '14 day') + interval '14 day' as month_15,
sum(amount)
from payments
group by payment_type_id, month_15
order by payment_type_id, month_15;
No conditional logic is actually needed for this.
Here is a db<>fiddle.

You can use the generate_series() function and make a inner join comparing month and year, like this:
SELECT specific_date_on_month, SUM(amount)
FROM (SELECT generate_series('2015-01-15'::date, '2015-12-15'::date, '1 month'::interval) AS specific_date_on_month)
INNER JOIN payments
ON (TO_CHAR(payment_date, 'yyyymm')=TO_CHAR(specific_date_on_month, 'yyyymm'))
GROUP BY specific_date_on_month;
The generate_series(<begin>, <end>, <interval>) function generate a serie based on begin and end with an specific interval.

How to create a table that loops over data in Postgres

I want to create a table that returns the top 10 aggregate cons_name over a given week, that repeats every day.
So for 5/29/2019 it will pull the top 10 cons_name by their sum dating back to 5/22/2019.
Then, for 5/28/2019, the top 10 cons_name by their sum back to 5/21/2019.
A table of top 10 dating back 7 days all the way to 2018-12-01.
I can make the simple code dating back 7 days but, I have tried Windows to no avail.
SELECT cons_name,
pricedate,
sum(shadow)
FROM spp.rtbinds
WHERE pricedate >= current_date - 7
GROUP BY cons_name, shadow, pricedate
ORDER BY shadow asc
LIMIT 10
This query generates the output below
cons_name pricedate sum
"TEMP17_24078" "2019-05-28 00:00:00" "-1473.29723333333"
"TEMP17_24078" "2019-05-28 00:00:00" "-1383.56638333333"
"TMP175_24736" "2019-05-23 00:00:00" "-1378.40504166667"
"TMP159_24149" "2019-05-23 00:00:00" "-1328.847675"
"TMP397_24836" "2019-05-23 00:00:00" "-1221.19560833333"
"TEMP17_24078" "2019-05-28 00:00:00" "-1214.9914"
"TMP175_24736" "2019-05-23 00:00:00" "-1123.83254166667"
"TEMP72_22893" "2019-05-29 00:00:00" "-1105.93840833333"
"TMP164_23704" "2019-05-24 00:00:00" "-1053.051375"
"TMP175_24736" "2019-05-27 00:00:00" "-1043.52104166667"
I would like a table and function that returns a table of each day's top 10 dating back a week.

Using window functions get's you on the right track but you should be reading further in the documentation about the possibilities.
We have multiple issues here that we need to solve:
gaps in the data (missing pricedate) not get us the correct number of rows (7) to calculate the overall sum
for the calculation itself we need all data rows so the WHERE clause cannot be used to limit only to the visible days
in order to select the top-10 for each day, we have to generate a row number per partition because the LIMIT clause cannot be applied per group
This is why I came up with the following CTE's:
CTE days: generate the gap-less date series and mark visible days
CTE daily: LEFT JOIN the data to the generated days and produce daily sums (and handle NULL entries)
CTE calc: produce the cumulative sums
CTE numbered: produce row numbers reset each day
select the actual visible rows and limit them to max. 10 per day
So for a specific week (2019-05-26 - 2019-06-01), the query will look like the following:
WITH
days (c_day, c_visible, c_lookback) as (
SELECT gen::date, (CASE WHEN gen::date < '2019-05-26' THEN false ELSE true END), gen::date - 6
FROM generate_series('2019-05-26'::date - 6, '2019-06-01'::date, '1 day'::interval) AS gen
),
daily (cons_name, pricedate, shadow_sum) AS (
SELECT
r.cons_name,
r.pricedate::date,
coalesce(sum(r.shadow), 0)
FROM days
LEFT JOIN spp.rtbinds AS r ON (r.pricedate::date = days.c_day)
GROUP BY 1, 2
),
calc (cons_name, pricedate, shadow_sum) AS (
SELECT
cons_name,
pricedate,
sum(shadow_sum) OVER (PARTITION BY cons_name ORDER BY pricedate ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM daily
),
numbered (cons_name, pricedate, shadow_sum, position) AS (
SELECT
calc.cons_name,
calc.pricedate,
calc.shadow_sum,
ROW_NUMBER() OVER (PARTITION BY calc.pricedate ORDER BY calc.shadow_sum DESC)
FROM calc
)
SELECT
days.c_lookback,
numbered.cons_name,
numbered.shadow_sum
FROM numbered
INNER JOIN days ON (days.c_day = numbered.pricedate AND days.c_visible)
WHERE numbered.position < 11
ORDER BY numbered.pricedate DESC, numbered.shadow_sum DESC;
Online example with generated test data: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=a83a52e33ffea3783207e6b403bc226a
Example output:
c_lookback | cons_name | shadow_sum
------------+--------------+------------------
2019-05-26 | TMP400_27000 | 4578.04474575352
2019-05-26 | TMP700_25000 | 4366.56857151864
2019-05-26 | TMP200_24000 | 3901.50325547671
2019-05-26 | TMP400_24000 | 3849.39595793188
2019-05-26 | TMP700_28000 | 3763.51693260809
2019-05-26 | TMP600_26000 | 3751.72016620729
2019-05-26 | TMP500_28000 | 3610.75970225036
2019-05-26 | TMP300_26000 | 3598.36888491176
2019-05-26 | TMP600_27000 | 3583.89777677553
2019-05-26 | TMP300_21000 | 3556.60386707587
2019-05-25 | TMP400_27000 | 4687.20302128047
2019-05-25 | TMP200_24000 | 4453.61603102228
2019-05-25 | TMP700_25000 | 4319.10566615313
2019-05-25 | TMP400_24000 | 4039.01832416654
2019-05-25 | TMP600_27000 | 3986.68667223025
2019-05-25 | TMP600_26000 | 3879.92447655788
2019-05-25 | TMP700_28000 | 3632.56970774056
2019-05-25 | TMP800_25000 | 3604.1630071504
2019-05-25 | TMP600_28000 | 3572.50801157858
2019-05-25 | TMP500_27000 | 3536.57885829499
2019-05-24 | TMP400_27000 | 5034.53660146287
2019-05-24 | TMP200_24000 | 4646.08844632655
2019-05-24 | TMP600_26000 | 4377.5741555281
2019-05-24 | TMP700_25000 | 4321.11906399066
2019-05-24 | TMP400_24000 | 4071.37184911687
2019-05-24 | TMP600_25000 | 3795.00857752701
2019-05-24 | TMP700_26000 | 3518.6449117614
2019-05-24 | TMP600_24000 | 3368.15348120732
2019-05-24 | TMP200_25000 | 3305.84444172308
2019-05-24 | TMP500_28000 | 3162.57388606668
2019-05-23 | TMP400_27000 | 4057.08620966971
2019-05-23 | TMP700_26000 | 4024.11812392669
...

How to write a SQL statement to sum data using group by the same day of every two neighboring months

I have a data table like this:
datetime data
-----------------------
...
2017/8/24 6.0
2017/8/25 5.0
...
2017/9/24 6.0
2017/9/25 6.2
...
2017/10/24 8.1
2017/10/25 8.2
I want to write a SQL statement to sum the data using group by the 24th of every two neighboring months in certain range of time such as : from 2017/7/20 to 2017/10/25 as above.
How to write this SQL statement? I'm using SQL Server 2008 R2.
The expected results table is like this:
datetime_range data_sum
------------------------------------
...
2017/8/24~2017/9/24 100.9
2017/9/24~2017/10/24 120.2
...

One conceptual way to proceed here is to redefine a "month" as ending on the 24th of each normal month. Using the SQL Server month function, we will assign any date occurring after the 24th as belonging to the next month. Then we can aggregate by the year along with this shifted month to obtain the sum of data.
WITH cte AS (
SELECT
data,
YEAR(datetime) AS year,
CASE WHEN DAY(datetime) > 24
THEN MONTH(datetime) + 1 ELSE MONTH(datetime) END AS month
FROM yourTable
)
SELECT
CONVERT(varchar(4), year) + '/' + CONVERT(varchar(2), month) +
'/25~' +
CONVERT(varchar(4), year) + '/' + CONVERT(varchar(2), (month + 1)) +
'/24' AS datetime_range,
SUM(data) AS data_sum
FROM cte
GROUP BY
year, month;
Note that your suggested ranges seem to include the 24th on both ends, which does not make sense from an accounting point of view. I assume that the month includes and ends on the 24th (i.e. the 25th is the first day of the next accounting period.
Demo

I would suggest dynamically building some date range rows so that you can then join you data to those for aggregation, like this example:
+----+---------------------+---------------------+----------------+
| | period_start_dt | period_end_dt | your_data_here |
+----+---------------------+---------------------+----------------+
| 1 | 24.04.2017 00:00:00 | 24.05.2017 00:00:00 | 1 |
| 2 | 24.05.2017 00:00:00 | 24.06.2017 00:00:00 | 1 |
| 3 | 24.06.2017 00:00:00 | 24.07.2017 00:00:00 | 1 |
| 4 | 24.07.2017 00:00:00 | 24.08.2017 00:00:00 | 1 |
| 5 | 24.08.2017 00:00:00 | 24.09.2017 00:00:00 | 1 |
| 6 | 24.09.2017 00:00:00 | 24.10.2017 00:00:00 | 1 |
| 7 | 24.10.2017 00:00:00 | 24.11.2017 00:00:00 | 1 |
| 8 | 24.11.2017 00:00:00 | 24.12.2017 00:00:00 | 1 |
| 9 | 24.12.2017 00:00:00 | 24.01.2018 00:00:00 | 1 |
| 10 | 24.01.2018 00:00:00 | 24.02.2018 00:00:00 | 1 |
| 11 | 24.02.2018 00:00:00 | 24.03.2018 00:00:00 | 1 |
| 12 | 24.03.2018 00:00:00 | 24.04.2018 00:00:00 | 1 |
+----+---------------------+---------------------+----------------+
DEMO
declare #start_dt date;
set #start_dt = '20170424';
select
period_start_dt, period_end_dt, sum(1) as your_data_here
from (
select
dateadd(month,m.n,start_dt) period_start_dt
, dateadd(month,m.n+1,start_dt) period_end_dt
from (
select #start_dt start_dt ) seed
cross join (
select 0 n union all
select 1 union all
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10 union all
select 11
) m
) r
-- LEFT JOIN YOUR DATA
-- ON yourdata.date >= r.period_start_dt and data.date < r.period_end_dt
group by
period_start_dt, period_end_dt
Please don't be tempted to use "between" when it comes to joining to your data. Follow the note above and use yourdata.date >= r.period_start_dt and data.date < r.period_end_dt otherwise you could double count information as between is inclusive of both lower and upper boundaries.

I think the simplest way is to subtract 25 days and aggregate by the month:
select year(dateadd(day, -25, datetime)) as yr,
month(dateadd(day, -25, datetime)) as mon,
sum(data)
from t
group by dateadd(day, -25, datetime);
You can format yr and mon to get the dates for the specific ranges, but this does the aggregation (and the yr/mon columns might be sufficient).

Step 0: Build a calendar table. Every database needs a calendar table eventually to simplify this sort of calculation.
In this table you may have columns such as:
Date (primary key)
Day
Month
Year
Quarter
Half-year (e.g. 1 or 2)
Day of year (1 to 366)
Day of week (numeric or text)
Is weekend (seems redundant now, but is a huge time saver later on)
Fiscal quarter/year (if your company's fiscal year doesn't start on Jan. 1)
Is Holiday
etc.
If your company starts its month on the 24th, then you can add a "Fiscal Month" column that represents that.
Step 1: Join on the calendar table
Step 2: Group by the columns in the calendar table.
Calendar tables sound weird at first, but once you realize that they are in fact tiny even if they span a couple hundred years they quickly become a major asset.
Don't try to cheap out on disk space by using computed columns. You want real columns because they are much faster and can be indexed if necessary. (Though honestly, usually just the PK index is enough for even wide calendar tables.)

How do I group by month when I have data in a time range, accurate up to the second?

I'd like to ask if there's a way to group my data by months in this case:
I have table of orders, with order Ids in a column and the dates the orders were created in another.
For example,
orderId | creationDate
58111 | 2017-01-01 00:00:00
58111 | 2017-01-12 00:00:00
58232 | 2017-01-31 00:00:00
62882 | 2017-02-21 00:00:00
90299 | 2017-03-20 00:00:00
I need to find the number of unique orderIds, grouped by month. Normally this would be simple, but with my creationDates accurate to the second, I have no idea how to segment them into months. Ideally, this is what I'd obtain:
creationMonth | count_orderId
January | 2
February | 1
March | 1

Try this:
select count( distinct orderId ), year( creationDate ), month( creationDate )
from my_table group by year( creationDate ), month( creationDate )

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL select query using Joins with aggregate counts - sql

Related

Question: Joining two data sets with date conditions

How can I aggregate values based on an arbitrary monthly cycle date range in SQL?

How to create a table that loops over data in Postgres

How to write a SQL statement to sum data using group by the same day of every two neighboring months

How do I group by month when I have data in a time range, accurate up to the second?

Categories

Resources