Using analytical Count(distinct) on Vertica is not supported - sql

Having a thorough Google research, it seems that Vertica DB simply does not support count(distinct <col>) over(<partition by>), as it causes:
"ERROR 4249: Only MIN/MAX are allowed to use DISTINCT ... MIN/MAX are allowed to use DISTINCT"
I'm looking for an easy walk-around for this one.
Meanwhile, I'm using joins or nested queries.
For example:
select campaign_id, segment_id, COUNT(DECODE(rank, 1, 1, NULL)) over()
from (select campaign_id, segment_id, row_number() over(partition by segment_id) rank
from cs)
But my query is very long and I need to invent tricks all over the way. Any idea for a better approach?
Thanks!
(Working at HPE? Please implement this, as you did for all common analytical funcitions!)

I had to do something similar nested counting structure for counting distinct values cumulatively, over a date range. It boiled down to a similar gathering up of ROW_NUMBER() = 1 rows, though I used case:
COUNT(CASE WHEN rank = 1 THEN userID END) OVER (...)
It wasn't pretty to look at, but it was mercifully not slow.
I need to invent tricks all over the way
Yeah, I think that just happens when you bump into missing features.

Related

dynamically select arbitrary n columns with window function

Abstract of my real select statement:
select
lag(somecol) over (partition by thing_id) as prev_1,
lag(somecol,2) over (partition by thing_id) as prev_2,
lag(somecol,3) over (partition by thing_id) as prev_3,
othercol,
...
In the real query the over is much more complex which leads to pretty dense, unreadable code. Additionally, getting the last 3 rows is hardcoded (vs n=whatever).
Is there any way in straight SQL to iteratively or recursively specify these prev_x columns, so that 1) the code is more readable and 2) you could dynamically specify the number n of prev cols?
To answer just the first question, to make the code more readable, Postgres allows to define the window, name it and then reference it several times in the query.
See the docs for Window Functions:
When a query involves multiple window functions, it is possible to
write out each one with a separate OVER clause, but this is
duplicative and error-prone if the same windowing behavior is wanted
for several functions. Instead, each windowing behavior can be named
in a WINDOW clause and then referenced in OVER. For example:
SELECT sum(salary) OVER w, avg(salary) OVER w
FROM empsalary
WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);
I don't know if this feature is part of the SQL standard or not, but I know that SQL Server doesn't support it.
So, your query would look like this:
select
lag(somecol) over w as prev_1,
lag(somecol,2) over w as prev_2,
lag(somecol,3) over w as prev_3,
othercol,
...
from
...
WINDOW w AS (partition by thing_id)
;
Regarding your second question how to "dynamically specify the number n of prev cols" - you'll need to generate the text of SELECT statement dynamically to achieve that. RDBMS assume stable schema, i.e. the number of columns in tables and queries is usually fixed, not dynamic.

How do I find the previous line without writing an inefficient subquery?

So I have this query (and have encountered or coded a bunch of similar ones during my life :) ) which is extremely inefficient in terms of performance, due to the subquery.
I'm running pgsql currently but have had this issue with mysql and mssql as well.
Sometimes I can use MAX() but here I have 2 different columns: runners.id (the one I need to find my data) and runners.updated_at (the one on which I could do MAX()).
Any tips?
SELECT
ROUND(CAST(AVG(DATE_PART('day', current_claim_event.updated_at - claims.created_at)) AS NUMERIC),1)
AS average, count(*)
FROM claims_events current_claim_event
INNER JOIN claims ON claims.id = current_claim_event.claim_id
WHERE current_claim_event.id = (
SELECT runners.id
FROM claims_events runners
WHERE runners.claim_id = current_claim_event.claim_id
ORDER BY runners.updated_at DESC
LIMIT 1
);

Using part of the select clause without rewriting it

I am using an Oracle SQL Db and I am trying to count the number of terms starting with X letter in a dictionnary.
Here is my query :
SELECT Substr(Lower(Dict.Term),0,1) AS Initialchar,
Count(Lower(Dict.Term))
FROM Dict
GROUP BY Substr(Lower(Dict.Term),0,1)
ORDER BY Substr(Lower(Dict.Term),0,1);
This query is working as expected, but the thing that I'm not really happy about is the fact that I have to rewrite the long "Substr(Lower(Dict.Term),0,1)" in the GROUP BY and ORDER BY clause. Is there any way to reuse the one I defined in the SELECT part ?
Thanks
You can use a subquery. Because Oracle follows the SQL standard, substr() starts counting at 1. Although Oracle does explicitly allow 0 ("If position is 0, then it is treated as 1"), I find it misleading because "0" and "1" refer to the same position.
So:
select first_letter, count(*)
from (select d.*, substr(lower(d.term), 1, 1) as first_letter
from dict d
) d
group by first_letter
order by first_letter;
Not directly. The output columns can only be referred to in the ORDER BY clause, but not used in any other way. The only way would be to make it into a subselect, but it wouldn't be any clearer and might cause issues with performance.
I prefer subquery factoring for this purpose.
with init as (
select substr(lower(d.term), 1, 1) as Initialchar
from dict d)
select Initialchar, count(*)
from init
group by Initialchar
order by Initialchar;
Contrary to opposite meaning, IMO this makes the query much clearer and defines natural order; especially while using more subqueries.
I'm not aware about performance caveats, but there are some limitation, such as it not possible to use with clause within another with clause: ORA-32034: unsupported use of WITH clause.

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

display the cab with the highest overall maintenance cost

I'm not sure how to get the max of the sum. I thought i could just display it in descending order and then use "rownum=1" but that didnt work. Any suggestions? Here's my code.
select ca_make, sum(ma_cost)
from cab join maintain on ca_cabnum = ma_cabnum
Where rownum =1
group by ca_make
order by sum(ma_cost) desc
ROWNUM() is applied before the ORDER BY. You need to use a sub-query:
select * from (
select ca_make, sum(ma_cost)
from cab join maintain on ca_cabnum = ma_cabnum
group by ca_make
order by sum(ma_cost) desc
)
where rownum = 1
There are several different ways of implementing [top-n] queries in Oracle. Find out more by searching SO for [oracle] [top-n].
First of all, you probably want to use LEFT JOIN. By using JOIN, you're excluding all cabs that have had no maintenance at all. (obviously, this won't matter for finding the highest cost, but it will make a difference when looking for the lowest cost; and it would seriously skew any stats you tried to compile from this query).
Now, to answer your question... try this:
select * from
(select ca_make, sum(ma_cost)
from cab
left join maintain on ca_cabnum = ma_cabnum
group by ca_make
order by sum(ma_cost) desc)
where rownum = 1
here is a good explanation of ROWNUM. Your case is addressed specifically, a little less than half-way down the page (but the whole page is probably worth a read, if you're going to use the feature).