I've seen other posts on SO about getting ones head around partition and order by. Kinda get it but still a bit confused.
Here is a query provided by my colleague that works:
SELECT EMAIL, SUBSCRIPTION_NAME, SOURCE, BILLING_SYSTEM,
RATE_PLAN, NEXT_CHARGE_DATE, SERVICE_ACTIVATION_DATE, CONTRACT_EFFECTIVE_DATE,
SUBSCRIPTION_END_DATE, STATUS, LAST_MODIFIED_DATE, PRODUCT_NAME,
RATE_PLAN_NAME, LOAD_DATE
FROM theDB
QUALIFY COUNT(*) OVER (PARTITION BY EMAIL,CONTRACT_EFFECTIVE_DATE ) > 1
Is this query saying, in plain English, return the fields selected only where the count of records for CONTRACT_EFFECTIVE_DATE appear more than once for each EMAIL?
Put another way is it doing this, which does not run (I'm using Teradata and receive error message "Improper use of aggregate function" - when I see that message should I think "use QUALIFY and PARTITION BY?"):
SELECT EMAIL, SUBSCRIPTION_NAME, SOURCE, BILLING_SYSTEM,
RATE_PLAN, NEXT_CHARGE_DATE, SERVICE_ACTIVATION_DATE, CONTRACT_EFFECTIVE_DATE,
SUBSCRIPTION_END_DATE, STATUS, LAST_MODIFIED_DATE, PRODUCT_NAME,
RATE_PLAN_NAME, LOAD_DATE
FROM RDMATBLSANDBOX.TmpNIMSalesForceDB
WHERE COUNT(CONTRACT_EFFECTIVE_DATE) >1
GROUP BY EMAIL
Not quite. Your query, if it ran, would return one row per email (at least it would as MySQL interprets this non-standard syntax). The original version will return multiple rows for each email.
The equivalent query is essentially:
select q.*
from (<your query here>
) q join
(select EMAIL, CONTRACT_EFFECTIVE_DATE
from theDB
group by EMAIL, CONTRACT_EFFECTIVE_DATE
having count(*) > 1
) filter
on q.email = filter.email and q.CONTRACT_EFFECTIVE_DATE = e.CONTRACT_EFFECTIVE_DATE;
There is a subtle difference, which is usually immaterial. Your version will recognize NULL values in either or both fields. This version will filter those out, even if there are duplicates.
EDIT:
If you just want the list of emails, use group by:
select email
from theDB t
where CONTRACT_EFFECTIVE_DATE between #start and #end
group by email
having count(*) = 5
(or whatever the specific conditions are).
If you need more information about the email or joins, join back to the original tables.
When you are comfortable with this process, you can think about using window/analytic functions to do the same thing. My concern is that the conditions that you really want may become more complicated and doing the logic in two steps (get the emails, get the additional information) will help you refine it.
Related
I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.
Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.
You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1
I am trying to migrate to Standard SQL from BigQuery Legacy SQL. The Legacy product offered the ability to query "WITHIN RECORD" which came in handy on numerous occasions.
I am looking for an efficient alternative to WITHIN RECORD. I could always just use a few subqueries and join them but wondering if there may be a more efficient way using ARRAY + ORDINAL.
EXAMPLE: Consider the following Standard SQL
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber, h.page.pagePath,
# This would previously use WITHIN RECORD in Legacy SQL:
ARRAY( SELECT eventInfo.eventAction FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay"
ORDER BY hitNumber DESC
)[ORDINAL(1)] AS lastVideoSeen
FROM
`proj.ga_sessions`, UNNEST(hits) as h
GROUP BY fullVisitorId, visitNumber, h.page.pagePath, lastVideoSeen
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM
(SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays
GROUP BY
pagePath, lastVideoSeen
)
Resulting output:
Questions:
1) I would like to see the last video play event on a given page, which is what I used accomplish using WITHIN RECORD but am attempting the ARRAY + ORDINAL approach shown above. However for this to work, I'm thinking the SELECT statement within ARRAY() must get synchronized to the outer record since it is now flattened? Is that accurate?
2) I would also like get a COUNT of DISTINCT videos played on a given page and wondering if more efficient approach would be joining to a separate query OR inserting another inline aggregate function, like done with ARRAY above.
Any suggestions would be appreciated.
1) I would like to see the last video play event on a given page,
which is what I used accomplish using WITHIN RECORD but am attempting
the ARRAY + ORDINAL approach shown above. However for this to work,
I'm thinking the SELECT statement within ARRAY() must get synchronized
to the outer record since it is now flattened? Is that accurate?
I think that is correct. With your query the UNNEST(hits) from the inner query would be independent from the outer UNNEST, and is probably not want you wanted.
I think maybe one way to write it is this:
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber,
ARRAY(
SELECT AS STRUCT pagePath, lastVideoSeen FROM (
SELECT
page.pagePath,
eventInfo.eventAction AS lastVideoSeen,
ROW_NUMBER() OVER (PARTITION BY page.pagePath ORDER BY hitNumber DESC) AS rank
FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay")
WHERE rank = 1
) AS lastVideoSeenOnPage
FROM
`proj.ga_sessions`
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM (
SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays, UNNEST(lastVideoSeenOnPage)
GROUP BY
pagePath, lastVideoSeen
)
2) I would also like get a COUNT of DISTINCT videos played on a given
page and wondering if more efficient approach would be joining to a
separate query OR inserting another inline aggregate function, like
done with ARRAY above.
I think both are OK, but inserting another inline aggregate function would evaluate them closer together, so it might be a bit easier for the query engine to optimize if there is a chance.
I'm trying to figure the best way to perform this query in postgresql. I have a messages table and I want to grab the last message a user received from each distinct user. I need to select everything from the row.
I would think this is where I want to group by the senders id "msgfromid", but when I do this it complains that I haven't included everything from my select statement in my group by statement, but I only want to group by the one column, not all of them. So if I try to use Distinct on the one column it forces me to order by the 'distinct on' column first ("msgfromid") which prevents me from being able to get the correct row I need (ordered by the last message sent from the sender "msgsenttime").
My goal is to make this as efficient as possible on my server and database.
This is what I have right now, not working. (This is a sub-query of another query I use to join relevant information afterwards but I figure that is irrelevant)
SELECT DISTINCT ON ("msgfromid") "msgfromid", "msgid", "msgtoid", "msgsenttime", "msgreadtime", "msgcontent", "msgreportstatus", "senderun", "recipientun"
FROM "messages"
WHERE "msgtoid" = ?
ORDER BY "msgsenttime" desc, "msgfromid"
I thought maybe if I pre-ordered them in a sub-query it would work but it just seems to randomly pick one anyway, and this can't be very efficient, even if it were to work, since I'm pulling every message out to begin with, right?:
SELECT DISTINCT ON ("msgfromid") "msgfromid", "msgid", "msgtoid", "msgsenttime", "msgreadtime", "msgcontent", "msgreportstatus", "senderun", "recipientun"
FROM
(
SELECT * FROM "messages"
WHERE "msgtoid" = ?
ORDER BY "msgsenttime" desc
) as "mqo"
Thanks for any help.
Your order by keys are in the wrong order:
SELECT DISTINCT ON ("msgfromid") m.*
FROM "messages" m
WHERE "msgtoid" = ?
ORDER BY "msgfromid", "msgsenttime" desc;
For DISTINCT ON, the keys in parentheses need to be the first keys in the ORDER BY.
If you want the final result ordered in a different way, then you need to use a subquery, with a different ORDER BY on the outer query.
I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...
You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
I would however like access the one of the non-aggregates associated with the max. In plain english, I want a table with the oldest id of each kind.
CREATE TABLE stuff (
id int,
kind int,
age int
);
This query gives me the information I'm after:
SELECT kind, MAX(age)
FROM stuff
GROUP BY kind;
But it's not in the most useful form. I really want the id associated with each row so I can use it in later queries.
I'm looking for something like this:
SELECT id, kind, MAX(age)
FROM stuff
GROUP BY kind;
That outputs this:
SELECT stuff.*
FROM
stuff,
( SELECT kind, MAX(age)
FROM stuff
GROUP BY kind) maxes
WHERE
stuff.kind = maxes.kind AND
stuff.age = maxes.age
It really seems like there should be a way to get this information without needing to join. I just need the SQL engine to remember the other columns when it's calculating the max.
You can't get the Id of the row that MAX found, because there might not be only one id with the maximum age.
You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
You can, and have to, define what you are grouping by for the aggregate function to return the correct result.
MySQL (and SQLite) decided in their infinite wisdom that they would go against spec, and allow queries to accept GROUP BY clauses missing columns quoted in the SELECT - it effectively makes these queries not portable.
It really seems like there should be a way to get this information without needing to join.
Without access to the analytic/ranking/windowing functions that MySQL doesn't support, the self join to a derived table/inline view is the most portable means of getting the result you desire.
I think it's tempting indeed to ask the system to solve the problem in one pass rather than having to do the job twice (find the max, and the find the corresponding id). You can do using CONCAT (as suggested in Naktibalda refered article), not sure that would be more effeciant
SELECT MAX( CONCAT( LPAD(age, 10, '0'), '-', id)
FROM STUFF1
GROUP BY kind;
Should work, you have to split the answer to get the age and the id.
(That's really ugly though)
In recent databases you can use sum() over (parition by ...) to solve this problem:
select id, kind, age as max_age from (
select id, kind, age, max(age) over (partition by kind) as mage
from table)
where age = mage
This can then be single pass
PostgesSQL's DISTINCT ON will be useful here.
SELECT DISTINCT ON (kind) kind, id, age
FROM stuff
ORDER BY kind, age DESC;
This groups by kind and returns the first row in the ordered format. As we have ordered by age in descending order, we will get the row with max age for kind.
P.S. columns in DISTINCT ON should appear first in order by
You have to have a join because the aggregate function max retrieves many rows and chooses the max.
So you need a join to choose the one that the agregate function has found.
To put it a different way how would you expect the query to behave if you replaced max with sum?
An inner join might be more efficient than your sub query though.