Vertica SQL for running count distinct and running conditional count - sql

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.

Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.

Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

Related

How to calculate the turnover 1 month ago with the day and month values ​kept as int in SQL Server

This is my table:
id
Total
Date
1
3
410
2
4
121
3
7
630
4
8
629
5
9
101
Date part is saved as int month and day. How to find the total amount made in the previous month of the current month?
Try the following query:
Select Date_/100 as Date_Month, Sum(Total) as MonthlyTotal From YourTable
Where MONTH(GETDATE()) - 1 = (Date_/100)
Group By (Date_/100)
See a demo from db<>fiddle.
I concur with the comments made above. I get it that you can't change the way that you store your "dates" . But you should make whoever made that decision miserable. They are asking for trouble like that.
Having said that -
I rely, here, on the fact that the division of two integers, in SQL Server, if it is not an integer already, is truncated to the next lower integer, and returned as integer.
WITH
-- your input, don't use in final query ...
-- I renamed the third column more aptly to "monthday"
indata(id,Total,monthday) AS (
SELECT 1,3.00,410
UNION ALL SELECT 2,4.55,121
UNION ALL SELECT 3,7.40,630
UNION ALL SELECT 4,8.00,629
)
-- real query starts here - replace following comma with "WITH"
,
per_month AS (
SELECT
monthday / 100 AS monthno
, SUM(total) AS monthtot
FROM indata
GROUP BY monthno
-- ctl monthno | monthtot
-- ctl ---------+----------
-- ctl 1 | 4.55
-- ctl 4 | 3.00
-- ctl 6 | 15.40
)
SELECT
*
FROM per_month
WHERE monthno=MONTH(CURRENT_DATE) - 1;
-- out monthno | monthtot
-- out ---------+----------
-- out 6 | 15.40

Doing a distinct count on an employee history table, based on departments at a current point in time

So I have an employee table with data on all employee since the beginning. In the data I have all the data I should need. I have the employee startdate, enddate (null if nothing), I have the name of the department, and if a department has changed, that specific employee has a new line, with a new department value, and two columns called "DepValidFrom" and "DepValidto", in date format that determines the time-period that the current employee was in that specific department.
My goal is, to get into a matrix, a list of all the departments as rows, and with year and month as columns, and the number of employees in that department at that time as values. I have all the data, I just cannot find the exact way to write my PowerBI Measure or perhaps even SQL query.
So.... I am trying to pull this into Power BI, and I am getting an incomplete view. I want my data to look like the following:
Department | Jan | Feb | Mar | Apr |
Dep1 | 3 | 5 | 6 | 4 |
Dep2 | 2 | 3 | 2 | 3 |
Dep3 | 1 | 1 | 2 | 3 |
Right now I am just using a very simple DISTINCTCOUNT(Emp_Table[EmployeeInitials]) which gives me an incomplete view, as it only counts on the specific date, and doesn't retain the number into a total, leaving a bunch of empty values.
I hope someone can understand what I mean, and that someone can help!
Thanks!
You can start by unpivoting the dates and generating a query that gives the number of employee per department and date:
select e.dept, x.dt, sum(cnt) over(partition by dept order by dt) cnt
from employees e
cross apply (values (startdate, 1), (enddate, -1)) as x(dt, cnt)
where dt is not null
Then, you can do conditional aggregation to pivot the results - this requires enumerating the dates though:
select dept,
max(case when dt >= '20200101' and dt < '20200201' then cnt else 0 end) cnt_202001,
max(case when dt >= '20200201' and dt < '20200301' then cnt else 0 end) cnt_202002,
...
from (
select e.dept, x.dt, sum(cnt) over(partition by dept order by dt) cnt
from employees e
cross apply (values (startdate, 1), (enddate, -1)) as x(dt, cnt)
where dt is not null
) t
group by dept
When an employee changes in the middle of the month, it is counted in both departments for that month.

Those who listened to more than 10 mins each month in the last 6 months

I'm trying to figure out the count of users who listened to more than 10 mins each month in the last 6 months
We have this event: Song_stopped_listen and one attribute is session_progress_ms
Now I'm trying to see the monthly evolution of the count of this cohort over the last 6 months.
I'm using bigquery and this is the query I tried, but I feel that something is off semantically, but I couldn't put my finger on:
SELECT
CONCAT(CAST(EXTRACT(YEAR FROM DATE (timestamp)) AS STRING),"-",CAST(EXTRACT(MONTH FROM DATE (timestamp)) AS STRING)) AS date
,SUM(absl.session_progress_ms/(1000*60*10)) as total_10_ms, COUNT(DISTINCT u.id) as total_10_listeners
FROM ios.song_stopped_listen as absl
LEFT JOIN ios.users u on absl.user_id = u.id
WHERE absl.timestamp > '2018-05-01'
Group by 1
HAVING(total_10_ms > 1)
Please help figure out what I'm doing wrong here.
Thank you.
data Sample:
user_id | session_progress_ms | timestamp
1 | 10000 | 2017-10-10 14:34:25.656 UTC
What I want to have:
||Month-year | Count of users who listened to more than 10 mins
|2018-5 | 500
|2018-6 | 600
|2018-7 | 300
|2018-8 | 5100
|2018-9 | 4500
|2018-10 | 1500
|2018-11 | 1500
|2018-12 | 2500
Use multiple levels of aggregation:
select user_id
from (select ssl.user_id, timestamp_trunc(timestamp, month) as mon,
sum(ssl.session_progress_ms/(1000*60)) as total_minutes
from ios.song_stopped_listen as ssl
where date(ssl.timetamp) < date_trunc(current_date, month) and
date(ssl.timestamp) >= date_add(date_trunc(current_date, month) interval 6 month),
group by 1, 2
) u
where total_minutes >= 10
group by user_id
having count(*) = 6;
To get the count, just use this as a subquery with count(*).

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day

How to find N Consecutive records in a table using SQL

I have the following Table definition with sample data. In the following table, Customer Product & Date are key fields
Table One
Customer Product Date SALE
X A 01/01/2010 YES
X A 02/01/2010 YES
X A 03/01/2010 NO
X A 04/01/2010 NO
X A 05/01/2010 YES
X A 06/01/2010 NO
X A 07/01/2010 NO
X A 08/01/2010 NO
X A 09/01/2010 YES
X A 10/01/2010 YES
X A 11/01/2010 NO
X A 12/01/2010 YES
In the above table, I need to find the N or > N consecutive records where there was no sale, Sale value was 'NO'
For example, if N is 2, the the result set would return the following
Customer Product Date SALE
X A 03/01/2010 NO
X A 04/01/2010 NO
X A 06/01/2010 NO
X A 07/01/2010 NO
X A 08/01/2010 NO
Can someone help me with a SQL query to get the desired results. I am using SQL Server 2005. I started playing using ROW_NUMBER() AND PARTITION clauses but no luck.
Thanks for any help
You need to match your table against itself, as if there where 2 tables. So you use two aliases, o1 and o2 to refer to your table:
SELECT DISTINCT o1.customer, o1.product, o1.datum, o1.sale
FROM one o1, one o2
WHERE (o1.datum = o2.datum-1 OR o1.datum = o2.datum +1)
AND o1.sale = 'NO'
AND o2.sale = 'NO';
customer | product | datum | sale
----------+---------+------------+------
X | A | 2010-01-03 | NO
X | A | 2010-01-04 | NO
X | A | 2010-01-06 | NO
X | A | 2010-01-07 | NO
X | A | 2010-01-08 | NO
Note that I performed the query on an postgresql database - maybe the syntax differs on ms-sql-server, maybe at the alias 'FROM one AS o1' perhaps, and maybe you cannot add/substract in that way.
A different approach, inspired by munchs last line.
Get - for a given date the first date with YES later than that, and the last date with YES earlier than that. These form the boundary, where our dates shall fit in.
SELECT (o1.datum),
MAX (o3.datum) - MIN (o2.datum) AS diff
FROM one o1, one o2, one o3
WHERE o1.sale = 'NO'
AND o3.datum <
(SELECT MIN (datum)
FROM one
WHERE datum >= o1.datum
AND SALE = 'YES')
AND o2.datum >
(SELECT MAX (datum)
FROM one
WHERE datum <= o1.datum
AND SALE = 'YES')
GROUP BY o1.datum
HAVING MAX (o3.datum) - MIN (o2.datum) >= 2
ORDER BY o1.datum;
Maybe it needs some kind of optimization, because table one is 5 times involved in the query. :)
Thanks to everyone for posting your solution. Thought, I would also share my solution with everyone. Just as an FYI, I received this solution from another SQL Server Central forum member. I am definitely not going to take credit for this solution.
DECLARE #CNT INT
SELECT #CNT = 3
SELECT * FROM
(
SELECT
[Customer], [Product], [Date], [Sale], groupID,
COUNT(*) OVER (PARTITION BY [Customer], [Product], [Sale], groupID) AS groupCnt
FROM
(
SELECT
[Customer], [Product], [Date], [Sale],
ROW_NUMBER() OVER (PARTITION BY [Customer], [Product] ORDER BY [Date])
- ROW_NUMBER() OVER (PARTITION BY [Customer], [Product], [Sale] ORDER BY [Date]) AS groupID
FROM
[TableSales]
) T1
) T2
WHERE
T2.[Sale] = 'NO' AND T2.[groupCnt] >= #CNT
Ok, we need a variable answer. We search for a date, where we have N following dates, all with the sale-field being NO.
SELECT d1.datum
FROM one d1, one d2, i
WHERE d1.sale = 'NO' AND d2.sale = 'NO'
AND d1.datum = (d2.datum - i)
AND i > 0 AND i < 4
GROUP BY d1.datum
HAVING COUNT (*) = 3;
This will give us the date, which we use for subquerying.
Notes:
I used 'datum' instead of date, because date is a reserved keyword on postgresql.
In Oracle you can use a virtual table dummy, which contains anything you ask for, like 'SELCT foo FROM dual WHERE foo in (1, 2, 3);' which will give you 1, 2, 3, if I remember correctly. Depending on the vendor, there might be other tricks to get a sequence 1 to N. I created a table i with column i, and filled it with the values 1 to 100, and I expect N not to exceed 100; Since a few versions, postgresql contains a function 'generate_series (from, to) which would solve the problem too, and might have similarities with solutions for your specific database. But table i should work vendor independent.
if N == 17, you have to modify 3 places from 3 to 17.
The final query will be:
SELECT o4.*
FROM one o3, one o4
WHERE o3.datum = (
SELECT d1.datum
FROM one d1, one d2, i
WHERE d1.sale = 'NO' AND d2.sale = 'NO'
AND d1.datum = (d2.datum - i)
AND i > 0 AND i <= 3
GROUP BY d1.datum
HAVING COUNT (*) = 3)
AND o4.datum <= o3.datum + 3
AND o4.datum >= o3.datum;
customer | product | datum | sale
----------+---------+------------+------
X | A | 2010-02-06 | NO
X | A | 2010-02-07 | NO
X | A | 2010-02-08 | NO
X | A | 2010-02-09 | NO