Using subquery in conjunction with a WHERE clause - sql

There exists the following table:
practice=# select * from table;
letter | value | year
--------+---------+------
A | 5000.00 | 2021
B | 6000.00 | 2021
C | 6000.00 | 2021
B | 8000.00 | 2022
A | 9000.00 | 2022
C | 7000.00 | 2022
A | 2000.00 | 2021
B | 1000.00 | 2022
C | 3000.00 | 2021
(9 rows)
In order to calculate the percentages of A, B, and C relative to the total value (i.e. the sum of A values in the table divided by the sum of all values in the table), I am using a subquery as follows:
practice=# select letter, cast((group_values/(select sum(value) from percentages)*100) as decimal(4,2)) as group_values from (select letter, sum(value) as group_values from percentages group by letter order by letter) as subquery order by group_values desc;
letter | group_values
--------+--------------
A | 34.04
C | 34.04
B | 31.91
(3 rows)
However, I now want to be able to filter the results by year, e.g. calculate the above only where the year entries are 2022, for instance.
I have tried incorporating a WHERE clause within the subquery to filter by year.
select letter, cast((group_values/(select sum(value) from percentages)*100) as decimal(4,2)) as group_values from (select letter, sum(value) as group_values from percentages where year='2022' group by letter order by letter) as subquery order by group_values desc;
However, we can see that this does not update the total to only include the entries for 2022. Instead, it seems that SQL is calculating the percentage entries for 2022 across A, B, and C for the total across all years.
letter | group_values
--------+--------------
A | 19.15
B | 19.15
C | 14.89
(3 rows)
Similarly, using the WHERE clause outside the subquery results in an error:
select letter, cast((group_values/(select sum(value) from percentages)*100) as decimal(4,2)) as group_values from (select letter, sum(value) as group_values from percentages group by letter order by letter) as subquery where year='2022' order by group_values desc;
ERROR: column "year" does not exist

The subquery would get you all te years, but adding the qhere clause there will only get the numbers for 2022
i also remove the unnecessary order by letters
SELECT
letter,
CAST((group_values / (SELECT
SUM(value)
FROM
percentages
WHERE
year = '2022') * 100)
AS DECIMAL (4 , 2 )) AS group_values
FROM
(SELECT
letter, SUM(value) AS group_values
FROM
percentages
WHERE
year = '2022'
GROUP BY letter) AS subquery
ORDER BY group_values DESC;

select letter, year, sum(value) * 1 0 / sum(value) over (partition by year)
from T
where year = 2022
group by letter, year;
The partition is redundant when all rows are in it but it will still work when the filter is removed too.

Related

Cumulative Sum Query in SQL table with distinct elements

I have a table like this, with column names as Date of Sale and insurance Salesman Names -
Date of Sale | Salesman Name | Sale Amount
2021-03-01 | Jack | 40
2021-03-02 | Mark | 60
2021-03-03 | Sam | 30
2021-03-03 | Mark | 70
2021-03-02 | Sam | 100
I want to do a group by, using the date of sale. The next column should display the cumulative count of the sellers who have made the sale till that date. But same sellers shouldn't be considered again.
For example,
The following table is incorrect,
Date of Sale | Count(Salesman Name) | Sum(Sale Amount)
2021-03-01 | 1 | 40
2021-03-02 | 3 | 200
2021-03-03 | 5 | 300
The following table is correct,
Date of Sale | Count(Salesman Name) | Sum(Sale Amount)
2021-03-01 | 1 | 40
2021-03-02 | 3 | 200
2021-03-03 | 3 | 300
I am not sure how to frame the SQL query, because there are two conditions involved here, cumulative count while ignoring the duplicates. I think the OVER clause along with the unbounded row preceding may be of some use here? Request your help
Edit - I have added the Sale Amount as a column. I need the cumulative sum for the Sales Amount also. But in this case , all the sale amounts should be considered unlike the salesman name case where only unique names were being considered.
One approach uses a self join and aggregation:
WITH cte AS (
SELECT t1.SaleDate,
COUNT(CASE WHEN t2.Salesman IS NULL THEN 1 END) AS cnt,
SUM(t1.SaleAmount) AS amt
FROM yourTable t1
LEFT JOIN yourTable t2
ON t2.Salesman = t1.Saleman AND
t2.SaleDate < t1.SaleDate
GROUP BY t1.SaleDate
)
SELECT
SaleDate,
SUM(cnt) OVER (ORDER BY SaleDate) AS NumSalesman,
SUM(amt) OVER (ORDER BY SaleDate) AS TotalAmount
FROM cte
ORDER BY SaleDate;
The logic in the CTE is that we try to find, for each salesman, an earlier record for the same salesman. If we can't find such a record, then we assume the record in question is the first appearance. Then we aggregate by date to get the counts per day, and finally take a rolling sum of counts in the outer query.
The best way to do this uses window functions to determine the first time a sales person appears. Then, you just want cumulative sums:
select saledate,
sum(case when seqnum = 1 then 1 else 0 end) over (order by saledate) as num_salespersons,
sum(sum(sales)) over (order by saledate) as running_sales
from (select t.*,
row_number() over (partition by salesperson order by saledate) as seqnum
from t
) t
group by saledate
order by saledate;
Note that this in addition to being more concise, this should have much, much better performance than a solution that uses a self-join.

Is there a way to select sum on one column based on other DISTINCT column, while grouping by third column(date) only

I have three columns
year | money | id
2020 100 01
2020 100 01
2019 50 02
2018 50 03
2020 40 04
results should be
Year | Money | total people
2020 | 240 | 4
** AS first two ids are the same, I tried it as below
select year, sum(money), Count( Distinct id) from table
group by year
But the result shows 4 people which is the correct but wrong sum, as it is counting all of the money
You can aggregate and then aggregate again:
select max(year), sum(money), count(*)
from (select distinct year, money, id
from t
) t;
You can use SUM() and COUNT(DISTINCT x).
For example:
select
year,
sum(money) as money,
(select count(distinct id) from t) as total_people
from t
where year = 2020
group by year;
Result:
YEAR MONEY TOTAL_PEOPLE
----- ------ ------------
2020 240 4
See running example at db<>fiddle.
Not the most performant, but if you wish to avoid a derived table, you can do
select distinct
max(year) over (),
sum(money) over (),
count(*) over ()
from t
group by year, money, id;
And if you want this grouped by year, you can define the partitions in the over clause

SQL Server Sum data between two dates group by date

I have following data in my table:
eb |anz
05.03.2020 | 2
06.03.2020 | 3
07.03.2020 | 1
08.03.2020 | 9
09.03.2020 | 10
10.03.2020 | 2
11.03.2020 | 20
12.03.2020 | 25
Now I need to sum the values in specific range for each date.
For example "12.03.2020": I want to sum the value of the 12th, 11th, 10th and 9th of march for the date "12.03.2020". Additionally I want to sum the other four values before 9th of march and divide the summary 1 by summary 2 by select.
So my calculation would be: (25+20+2+10)/(9+1+3+2) = 3.8
I would like to output the date and the calculated value for each date in table.
I tried to sum the first group for each date (in example 9th to 12th march) but the output is the same as the data in the table.
select
eb,
sum(anz)
from (select eb, count(*) as anz from myTable where eb != '' group by eb) tmp
where
convert(date, eb, 104) >= dateadd(day,-3,convert(datetime, eb, 104))
and convert(date, eb, 104) <= convert(date, eb, 104)
group by eb
order by convert(date, eb, 104)
It looks like the condition is being ignored. Do you have any advice for me?
Thanks a lot
Let me assume that data is stored correctly as a date then you can use window functions:
select t.*,
(sum(anz) over (order by eb rows between 3 preceding and current row) /
sum(anz) over (order by eb rows between 8 preceding and 4 preceding)
)
from t;
Note that if value is an integer, then use * 1.0 / to avoid integer division.
Also, this assumes that you have data on each date.

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.
Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.
Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

Postgres count number or rows and group them by timestamp

Let's assume I have one table in postgres with just 2 columns:
ID which is PK for the table (bigint)
time which is type of timestamp
Is there any way how to get IDs grouped by time BY YEAR- when the time is date 18 February 2005 it would fit in 2005 group (so result would be)
year number of rows
1998 2
2005 5
AND if the number of result rows is smaller than some number (for example 3) SQL will return the result by month
Something like
month number of rows
(February 2018) 5
(March 2018) 2
Is that possible some nice way in postgres SQL?
You can do it using window functions (as always).
I use this table:
TABLE times;
id | t
----+-------------------------------
1 | 2018-03-14 20:04:39.81298+01
2 | 2018-03-14 20:04:42.92462+01
3 | 2018-03-14 20:04:45.774615+01
4 | 2018-03-14 20:04:48.877038+01
5 | 2017-03-14 20:05:08.94096+01
6 | 2017-03-14 20:05:16.123736+01
7 | 2017-03-14 20:05:19.91982+01
8 | 2017-01-14 20:05:32.249175+01
9 | 2017-01-14 20:05:35.793645+01
10 | 2017-01-14 20:05:39.991486+01
11 | 2016-11-14 20:05:47.951472+01
12 | 2016-11-14 20:05:52.941504+01
13 | 2016-10-14 21:05:52.941504+02
(13 rows)
First, group by month (subquery per_month).
Then add the sum per year with a window function (subquery with_year).
Finally, use CASE to decide which one you will output and remove duplicates with DISTINCT.
SELECT DISTINCT
CASE WHEN yc > 5
THEN mc
ELSE yc
END AS count,
CASE WHEN yc > 5
THEN to_char(t, 'YYYY-MM')
ELSE to_char(t, 'YYYY')
END AS period
FROM (SELECT
mc,
sum(mc) OVER (PARTITION BY date_trunc('year', t)) AS yc,
t
FROM (SELECT
count(*) AS mc,
date_trunc('month', t) AS t
FROM times
GROUP BY date_trunc('month', t)
) per_month
) with_year
ORDER BY 2;
count | period
-------+---------
3 | 2016
3 | 2017-01
3 | 2017-03
4 | 2018
(4 rows)
Just count years. If it's at least 3, then you group by years, else by months:
select
case (select count(distinct extract(year from time)) from mytable) >= 3 then
to_char(time, 'yyyy')
else
to_char(time, 'yyyy-mm')
end as season,
count(*)
from mytable
group by season
order by season;
(Unlike many other DBMS, PostgreSQL allows to use alias names in the GROUP BY clause.)