Multiple group by keeping their orders - sql

Let's consider this example :
Clients Routes City Timestamp
1 10 NY 0
1 11 NY 10
1 12 WDC 11
1 13 NY 20
2 22 LA 15
What I want as an output is something like this :
Clients Routes_number City min(Timestamp)
1 2 NY 0
1 1 WDC 11
1 1 NY 20
2 1 LA 15
The idea here is that I have to do multiple group by that kept their orders. For example, if we see the cities for Client 1, we can understand that he travelled from NY -> WDC -> NY (in the same day). So the idea is like to do a group by that counts the routes_number and the minimum timestamp but it will stop EACH TIME it finds a new city. If I do a global group by I will get something like this :
Clients Routes_number City min(Timestamp)
1 3 NY 0
1 1 WDC 11
2 1 LA 15
With an output like this, we lost the information that we have NY-> WDC and AGAIN NY. We thought that he only did NY -> WDC in one way...
I don't even know if it's possible to do such a request using SQL or if I have to do it in my code (I am newbie in Spark & Scala but Scala is the language that I use).
Thank you !

This is a gaps-and-islands problem that can be solved with a difference of row numbers:
select client, city, count(*), min(timestamp)
from (select t.*,
row_number() over (partition by client, city order by timestamp) as seqnum_1,
row_number() over (partition by client order by timestamp) as seqnum_2
from t
) t
group by client, city, (seqnum_2 - seqnum_1);
Here is a db<>fiddle.
It can be tricky to see how the difference of row numbers works to identify adjacent rows with the same city value. If you look at the results of the subquery, you'll get a good idea on how it works.

Related

SQL Query: Combining group by with different record entries

Given the following table and records:
location
interaction
session
us
5
xyz
us
10
xyz
us
20
xyz
us
5
qrs
us
10
qrs
us
20
qrs
de
5
abc
de
10
abc
de
20
abc
fr
5
mno
fr
10
mno
I'm trying to create a query that will get a count of locations for all sessions that have interactions of 5 and 10, but NOT 20.
So assuming the above, the query will return
count
location
2
us
1
de
FR will not be in the results, because session 'mno' did not have an interaction of 20.
As far as I can tell, I need to group by session first, then group by location afterwards. Which might mean using a nested select statement. I've tried a few things, but am not sure how to proceed. Would appreciate any help on how to approach a query like this.
Maybe your example is wrong - if a location session shall have interactions 5 and 10 but not 20 then only 'fr' shall be in the result as both 'us' and 'de' do have 20 in all their sessions.
SQL fiddle here.
with t as
(
select location, array_agg(interaction) arr
from the_table
group by location, session
)
select count(*) cnt, location
from t
where arr #> array[5,10] and not arr #> array[20]
group by location;

Need Distinct address, ID etc with the different amount in one table using Plsql

Need help on the below scenario, please.
I want distinct address, ID, etc with the different amount in one table using plsql or
For example below is the current table
Address aRea zipcode ID Amount amount2 qua number
123 Howe's drive AL 1234 1234567 100 20 1 666666
123 Howe's drive AL 1234 1234567 5 05 2 abcccc
123 east drive AZ 456 8910112 200 11 1 777777
123 east drive AZ 456 8910112 5 5 2 SDN133
116 WOOD Ave NL 1234 2325890 3.23 1.25 1 10483210
116 WOOD Ave NL 1234 2325890 3.24 1.26 2 10483211
I need the output as below.
Address aRea zipcode ID Amount amount2 qua number
123 Howe's drive AL 1234 1234567 100 20 1 666666
5 05 2 abcccc
123 east drive AZ 456 8910112 200 11 1 777777
5 5 2 SDN133
116 WOOD Ave NL 1234 2325890 3.23 1.25 1 10483210
3.24 1.26 2 10483211
Below is for BigQuery Standard SQL
I would recommended below approach
#standardSQL
SELECT address, area, zipcode, id,
ARRAY_AGG(STRUCT(amount, amount2, qua, number)) info
FROM `project.dataset.table`
GROUP BY address, area, zipcode, id
if to apply to sample data from your question - output is
This type of task is usually better done on application side.
You can do this with SQL - but you need a column to order the record consistently:
select
case when rn = 1 then address end as address,
case when rn = 1 then area end as area,
case when rn = 1 then zipcode end as zipcode,
case when rn = 1 then id end as id,
amount,
amount2,
qua,
number
from (
select
t.*,
row_number() over(
partition by address, area, zipcode, id
order by ??
) rn
from mytable t
) t
order by address, area, zipcode, id, ??
The partition by clause of the window function lists the columns that you want to "group" together; you can modify it as you really need.
The order by clause of row_number() indicates how rows should be sorted within the partition: you need to decide which column (or set of columns) you want to use. For the output tu make sense, you also need an order by clause in the query, where the partition and ordering columns will be repeated.

Limit column value repeats to top 2

So I have this query:
SELECT
Search.USER_ID,
Search.SEARCH_TERM,
COUNT(*) AS Search.count
FROM Search
GROUP BY 1,2
ORDER BY 3 DESC
Which returns a response that looks like this:
USER_ID SEARCH_TERM count
bob dog 50
bob cat 45
sally cat 38
john mouse 30
sally turtle 10
sally lion 5
john zebra 3
john leopard 1
And my question is: How would I change the query, so that it only returns the top 2 most-searched-for-terms for any given user? So in the example above, the last row for Sally would be dropped, and the last row for John would also be dropped, leaving a total of 6 rows; 2 for each user, like so:
USER_ID SEARCH_TERM count
bob dog 50
bob cat 45
sally cat 38
john mouse 30
sally turtle 10
john zebra 3
In SQL Server, you can put the original query into a CTE, add the ROW_NUMBER() function. Then in the new main query, just add a WHERE clause to limit by the row number. Your query would look something like this:
;WITH OriginalQuery AS
(
SELECT
s.[User_id]
,s.Search_Term
,COUNT(*) AS 'count'
,ROW_NUMBER() OVER (PARTITION BY s.[USER_ID] ORDER BY COUNT(*) DESC) AS rn
FROM Search s
GROUP BY s.[User_id], s.Search_Term
)
SELECT oq.User_id
,oq.Search_Term
,oq.count
FROM OriginalQuery oq
WHERE rn <= 2
ORDER BY oq.count DESC
EDIT: I specified SQL Server as the dbms I used here, but the above should be ANSI-compliant and work in Snowflake.

Selecting Properties from Property_Features list

I have two SQL tables:
PROPERTY
PID Address
1 123 Center Road
2 23 North Road
3 3a/34 Crest Avenue
5 49 Large Road
6 2 Kingston Way
7 4/232 Center Road
8 2/19 Ash Grove
9 54 Vintage Street
10 15 Charming Street
PROPERTY_FEATURE
P.PID Feature
1 Wine Cellar
1 Helipad
2 Tennis Court
2 Showroom
7 Swimming Pool - Above Ground
9 Swimming Pool - Below Ground
9 Wine Cellar
I want to Select the properties which contains specific features. For example, I would like to select the property ID which has the features Wine Cellar and Helipad, it would return the Property with the ID of 1.
Any ideas?
You can do this using Group By and Having clause
select PID
From PROPERTY_FEATURE
Group by PID
Having COUNT(case when Feature = 'Wine Cellar' then 1 end) > 0 --1
and COUNT(case when Feature = 'Helipad' then 1 end) > 0 -- 2
1 ) Counts only when Feature = 'Wine Cellar' & > 0 will make sure atleast one 'Wine Cellar' exist for each PID
2) Counts only when Feature = 'Helipad' & > 0 will make sure atleast one 'Helipad' exist for each PID
AND will make sure both 1 & 2 is satisfied then return the PID
You can do this by filtering on the required features, and then grouping and counting in a HAVING clause. You could also group directly (without filtering first) but if the table is very large, with many pid's, that will result in a lot of unnecessary grouping of rows that won't be used in the end.
Something like this:
select pid
from property_feature
where feature in ('Wine Cellar', 'Helipad')
group by pid
having count(feature) = 2;
This assumes there are no duplicates in the table (so you can't have 1 'Helipad' twice, messing up the count). If there can be duplicates, change the last line to count (distinct feature) = 2.

selecting top N rows for each group in a table

I am facing a very common issue regarding "Selecting top N rows for each group in a table".
Consider a table with id, name, hair_colour, score columns.
I want a resultset such that, for each hair colour, get me top 3 scorer names.
To solve this i got exactly what i need on Rick Osborne's blogpost "sql-getting-top-n-rows-for-a-grouped-query"
That solution doesn't work as expected when my scores are equal.
In above example the result as follow.
id name hair score ranknum
---------------------------------
12 Kit Blonde 10 1
9 Becca Blonde 9 2
8 Katie Blonde 8 3
3 Sarah Brunette 10 1
4 Deborah Brunette 9 2 - ------- - - > if
1 Kim Brunette 8 3
Consider the row 4 Deborah Brunette 9 2. If this also has same score (10) same as Sarah, then ranknum will be 2,2,3 for "Brunette" type of hair.
What's the solution to this?
If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:
;WITH HairColors AS
(SELECT id, name, hair, score,
ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
)
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3
This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.
So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!
The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like
a.name | a.score | b.name | b.score
-------+---------+---------+--------
Sarah | 9 | Sarah | 9
Sarah | 9 | Deborah | 9
and similarly for Deborah, which is why both girls get a rank of 2 here.
The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:
Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:
SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)
GROUP BY a.id
HAVING COUNT(*) <= 3
Can anyone see any problems with this approach that have escaped my notice?
Use this compound select which handles OP problem properly
SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)
Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).