How to use qualify in spark.sql

How to use qualify in spark.sql - sql

How do I write below sql query in pyspark.sql where I am not able to find alternate for qualify.
select * from table where id = 1 qualify transaction_ts = max(transaction_ts) over (partition by city)

As there was no expected input/output provided my answer may not be accurate
Qualify does not exists in core Spark (but for example its avilable in Databricks) but i think that you can do what you want with window function used in sub-query
Here is my example in Python
import datetime
import pyspark.sql.functions as F
x = [
(1, "Warsaw", datetime.date(2020, 10, 25)),
(1, "Warsaw", datetime.date(2020, 10, 22)),
(1, "Warsaw", datetime.date(2020, 10, 26)),
(2, "Cracow", datetime.date(2020, 10, 22)),
(2, "Cracow", datetime.date(2020, 10, 15)),
]
df = spark.createDataFrame(x, schema=["id", "city", "ts"])
df.createOrReplaceTempView("test_table")
spark.sql(
" select id, city, ts from (Select id, city, ts, MAX(ts) OVER (PARTITION BY city ORDER BY ts desc) AS max from test_table where id = 1) where ts = max "
).show()
Output:
+---+------+----------+
| id| city| ts|
+---+------+----------+
| 1|Warsaw|2020-10-26|
+---+------+----------+

Related

prestoSQL aggregate columns and rows into one column

I would like to aggregate some columns and rows into one column in prestoSQL table.
with example_table as (
select * from (
values ('A', 'nh', 7), ('A', 'mn', 4), ('A', 'sv', 3),
('B', 'tb', 6), ('B', 'ty', 5), ('A', 'rw', 2),
('C', 'op', 9), ('C', 'au', 8)
) example_table("id", "time", "value")
)
select id, agg(value, time) # Unexpected parameters (integer, VARCHAR(2)) for function array_agg. Expected: array_agg(T) T
from example_table
group by id
I would like to combine column "time" and "value" as one column and then aggregate all rows by "id" such that
id. time_value_agg
A. [['nh', 7], ['mn', 4], ['sv', 3], ['rw', 2]
B. [['tb', 6], ['tv',5]
C. [['op', 9], ['au', 8]]
the column
time_value_agg
should be an array of str. If the "time" col is not str, cast it to str.
I am not sure which function can be used for this ?
thanks

array_agg can be applied to single column only. If times are unique per id you can turn data into map:
select id, map(array_agg(time), array_agg(value)) time_value_agg
from example_table
group by id
Output:
id
time_value_agg
C
{op=9, au=8}
A
{mn=4, sv=3, rw=2, nh=7}
B
{ty=5, tb=6}
Or turn data into ROW type (or map) before aggregation:
select id,
array_agg(arr) time_value_agg
from (
select id, cast (row(time, value) as row(time varchar, value integer))arr
from example_table
)
group by id
Output:
id
time_value_agg
C
[{time=op, value=9}, {time=au, value=8}]
A
[{time=nh, value=7}, {time=mn, value=4}, {time=sv, value=3}, {time=rw, value=2}]
B
[{time=tb, value=6}, {time=ty, value=5}]

Conditional group by with window function in Snowflake query

I have a table in Snowflake in following format:
create temp_test(name string, split string, value int)
insert into temp_test
values ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400), ('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000), ('B', 'd',4000), ('B','e', 5000)
First step, I needed only top 2 value per name (sorted on value), so I used following query to get that:
select name, split, value,
row_number() over (PARTITION BY (name) order by value desc) as row_num
from temp_test
qualify row_num <= 2
Which gives me following resultset:
NAME SPLIT VALUE ROW_NUM
A e 500 1
A d 400 2
B e 5000 1
B d 4000 2
Now, I need to sum values other than Top 2 and put it in a different Split named as "Others", like this:
NAME SPLIT VALUE
A e 500
A d 400
A Others 600
B e 5000
B d 4000
B Others 6000
How to do that in Snowflake query or SQL in general?

with data as (
select name, split, value,
row_number() over (partition by (name) order by value desc) as row_num
from temp_test
)
select
name,
case when row_num <= 2 then split else 'Others' end as split,
sum(value) as value
from data
group by name, case when row_num <= 2 then row_num else 3 end

Shawnt00's answer is good, but for the record in Snowflake this can be written simpler:
Firstly the group by at the end can refer to the results by index or name:
GROUP BY 1,2
or
GROUP BY name, split
also as the CASE only has too branches an IFF can be used and seems you are using a CTE to add the row_number you can push the IFF into the CTE also
WITH data AS (
SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) AS row_num,
IFF(row_num < 3, split, 'Others') as n_split
FROM VALUES ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400),
('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000),
('B', 'd',4000), ('B','e', 5000)
v(name, split, value)
)
SELECT
name,
n_split,
SUM(value) AS value
FROM data
GROUP BY name, n_split;
and if super keen on small SQL push the ROW_NUMBER into the IFF:
WITH data AS (
SELECT name, value,
IFF(ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) < 3, split, 'Others') as n_split
FROM VALUES ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400),
('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000),
('B', 'd',4000), ('B','e', 5000)
v(name, split, value)
)
SELECT
name,
n_split AS split,
SUM(value) AS value
FROM data
GROUP BY name, n_split;
gives:
NAME SPLIT VALUE
A e 500
A d 400
A Others 600
B e 5000
B d 4000
B Others 6000

Python sqlite3 SQL query Get all entries with newest date but limit per single unique column

I have a table called 'fileEvents'. It has four columns (there are more but not relevant to the question): id, fileId, action and time.
The same fileId, action and time values can appear in multiple rows.
The query I want is simple but I can't think of a working one: Get the latest entry since a specific time for every fileId.
I tried the following.
First I will try to just get all entries sorted by time since a specific time:
SELECT * FROM `fileEvents` ORDER BY `time` DESC WHERE `time` < 1000
The result is of course fine (id, action, fileId, time):
[(6, 0, 3, 810), (5, 0, 3, 410), (2, 0, 1, 210), (3, 0, 2, 210), (4, 0, 3, 210), (1, 0, 1, 200)]
So it is all sorted. But now I only want unique fileIds. So I add a GROUP BYfileId`:
SELECT * FROM `fileEvents` GROUP BY `fileId` ORDER BY `time` DESC WHERE `time` < 1000
Which of course is wrong. Because first it will group the results and then sort them, but they are already grouped so there is no sorting:
[(3, 0, 2, 210), (4, 0, 3, 210), (1, 0, 1, 200)]
When I try to reverse the GROUP BY and ORDER BY, I get a OperationalError: near "GROUP": syntax error
Also when I try to do a sub query where I first get the sorted list and then group them the result is wrong:
SELECT * FROM `fileEvents` WHERE `id` IN (
SELECT `id` FROM `fileEvents` ORDER BY `time` DESC WHERE `time` < 1000
) GROUP BY `fileId`
With the (wrong) result:
[(1, 0, 1, 200), (3, 0, 2, 210), (4, 0, 3, 210)]
The result I am looking for is:
[(6, 0, 3, 810), (2, 0, 1, 210), (3, 0, 2, 210)]
Does anyone have an idea how I could get the result I want? What am I missing?
Thanks a lot!

With ROW_NUMBER() window function:
select * -- replace * with the columns that you want in the result
from (
select *, row_number() over (partition by fileid order by time desc) rn
from fileevents
where time < 1000
) t
where rn = 1

A typical solution to this top-1-per-group problem is to filter with a correlated subquery:
select fe.*
from fileevents fe
where fe.time = (
select max(fe1.time)
from fileevents fe1
where fe1.fileid = fe.fileid and fe1.time < 1000
)
For performance with this query, you want an index on (fileid, time).

PostgesSQL - SELECT rows only for the latest range found

I have this select statement
SELECT id, liked, markers, search_body, remote_bare_jid, direction
FROM mam_message where user_id='20' AND remote_bare_jid =
'5a95c47078f92c6337019521' ORDER BY id DESC;
that returns the following
I want to retrieve rows of the latest range direction 'I' -> 'I'
THIS:
SELECT id, liked, markers, search_body, remote_bare_jid, direction
FROM mam_message
where user_id ='20'
AND remote_bare_jid = '5a95c47078f92c6337019521'
ORDER BY id DESC
limit 4;
Even when the range is not on top
I am still able to get only the latest range of direction 'I'
THIS (the highligted):

You can find transition rows (where direction changes from O to I) using the window function lag(). Mark these rows as 1 (0 the others). Next, calculate cumulative sum of these marks. The group sought will have the sum = 1. Example:
with example(id, direction) as (
values
(1, 'O'),
(2, 'I'),
(3, 'I'),
(4, 'I'),
(5, 'O'),
(6, 'I')
)
select id, direction
from (
select id, direction, sum(mark) over w
from (
select
id, direction,
(lag(direction, 1, 'O') over w = 'O' and direction = 'I')::int mark
from example
window w as (order by id)
) s
window w as (order by id)
) s
where direction = 'I' and sum = 1
order by id
id | direction
----+-----------
2 | I
3 | I
4 | I
(3 rows)

Logic to check if exact ids (3+ records) are present in a group in SQL Server

I have some sample data like:
INSERT INTO mytable
([FK_ID], [TYPE_ID])
VALUES
(10, 1),
(11, 1), (11, 2),
(12, 1), (12, 2), (12, 3),
(14, 1), (14, 2), (14, 3), (14, 4),
(15, 1), (15, 2), (15, 4)
Now, here I am trying to check if in each group by FK_ID we have exact match of TYPE_ID values for 1, 2 & 3.
So, the expected output is like:
(10, 1) this should fail
As in group FK_ID = 10 we only have one record
(11, 1), (11, 2) this should also fail
As in group FK_ID = 11 we have two records.
(12, 1), (12, 2), (12, 3) this should pass
As in group FK_ID = 12 we have two records.
And all the TYPE_ID are exactly matching 1, 2 & 3 values.
(14, 1), (14, 2), (14, 3), (14, 4) this should also fail
As we have 4 records here.
(15, 1), (15, 2), (15, 4) this should also fail
Even though we have three records, it should fail as the TYPE_ID here (1, 2, 4) are not matching with required match (1, 2, 3).
Here is my attempt:
select * from mytable t1
where exists (select COUNT(t2.TYPE_ID)
from mytable t2 where t2.FK_ID = t1.FK_ID
and t2.TYPE_ID IN (1, 2, 3)
group by t2.FK_ID having COUNT(t2.TYPE_ID) = 3);
This is not working as expected, because it also pass for FK_ID = 14 which has four records.
Demo: SQL Fiddle
Also, how we can make it generic so that if we need to check for 4 or more TYPE_ID values like (1,2,3,4) or (1,2,3,4,5), we can do that easily by updating few values.

The following query will do what you want:
select fk_id
from t
group by fk_id
having sum(case when type_id in (1, 2, 3) then 1 else 0 end) = 3 and
sum(case when type_id not in (1, 2, 3) then 1 else 0 end) = 0;
This assumes that you have no duplicate pairs (although depending on how you want to handle duplicates, it might be as easy as using, from (select distinct * from t) t).
As for "genericness", you need to update the in lists and the 3.
If you want something more generic:
with vals as (
select id
from (values (1), (2), (3)) v(id)
)
select fk_id
from t
group by fk_id
having sum(case when type_id in (select id from vals) then 1 else 0 end) = (select count(*) from vals) and
sum(case when type_id not in (select id from vals) then 1 else 0 end) = 0;

You can use this code:
SELECT y.fk_id FROM
(SELECT x.fk_id, COUNT(x.type_id) AS count, SUM(x.type_id) AS sum
FROM mytable x GROUP BY (x.fk_id)) AS y
WHERE y.count = 3 AND y.sum = 6
For making it generic, you can equal y.count with N and y.sum with N*(N-1)/2, where N is the number you are looking for (1, 2, ..., N).

You can try this query. COUNT and DISTINCT used for eliminate duplicate records.
SELECT
[FK_ID]
FROM
#mytable T
GROUP BY
[FK_ID]
HAVING
COUNT(DISTINCT CASE WHEN [TYPE_ID] IN (1,2,3) THEN [TYPE_ID] END) = 3
AND COUNT(CASE WHEN [TYPE_ID] NOT IN (1,2,3) THEN [TYPE_ID] END) = 0

Try this:
select FK_ID,count(distinct TYPE_ID) from mytable
where TYPE_ID<=3
group by FK_ID
having count(distinct TYPE_ID)=3

You should use CTE with Dynamic pass Value which you have mentioned in Q.
WITH CTE
AS (
SELECT FK_ID,
COUNT(*) CNT
FROM #mytable
GROUP BY FK_ID
HAVING COUNT(*) = 3) <----- Pass Value here What you want to Display Result,
CTE1
AS (
SELECT T.[ID],
T.[FK_ID],
T.[TYPE_ID],
ROW_NUMBER() OVER(PARTITION BY T.[FK_ID] ORDER BY
(
SELECT NULL
)) RN
FROM #mytable T
INNER JOIN CTE C ON C.FK_ID = T.FK_ID),
CTE2
AS (
SELECT C1.FK_ID
FROM CTE1 C1
GROUP BY C1.FK_ID
HAVING SUM(C1.TYPE_ID) = SUM(C1.RN))
SELECT TT1.*
FROM CTE2 C2
INNER JOIN #mytable TT1 ON TT1.FK_ID = C2.FK_ID;
From above SQL Command which will produce Result (I have passed 3) :
ID FK_ID TYPE_ID
4 12 1
5 12 2
6 12 3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use qualify in spark.sql - sql

How do I write below sql query in pyspark.sql where I am not able to find alternate for qualify. select * from table where id = 1 qualify transaction_ts = max(transaction_ts) over (partition by city)

Related

prestoSQL aggregate columns and rows into one column

Conditional group by with window function in Snowflake query

Python sqlite3 SQL query Get all entries with newest date but limit per single unique column

PostgesSQL - SELECT rows only for the latest range found

Logic to check if exact ids (3+ records) are present in a group in SQL Server

Categories

Resources