How to use qualify in spark.sql - sql

How do I write below sql query in pyspark.sql where I am not able to find alternate for qualify.
select * from table where id = 1 qualify transaction_ts = max(transaction_ts) over (partition by city)

As there was no expected input/output provided my answer may not be accurate
Qualify does not exists in core Spark (but for example its avilable in Databricks) but i think that you can do what you want with window function used in sub-query
Here is my example in Python
import datetime
import pyspark.sql.functions as F
x = [
(1, "Warsaw", datetime.date(2020, 10, 25)),
(1, "Warsaw", datetime.date(2020, 10, 22)),
(1, "Warsaw", datetime.date(2020, 10, 26)),
(2, "Cracow", datetime.date(2020, 10, 22)),
(2, "Cracow", datetime.date(2020, 10, 15)),
]
df = spark.createDataFrame(x, schema=["id", "city", "ts"])
df.createOrReplaceTempView("test_table")
spark.sql(
" select id, city, ts from (Select id, city, ts, MAX(ts) OVER (PARTITION BY city ORDER BY ts desc) AS max from test_table where id = 1) where ts = max "
).show()
Output:
+---+------+----------+
| id| city| ts|
+---+------+----------+
| 1|Warsaw|2020-10-26|
+---+------+----------+

Related

prestoSQL aggregate columns and rows into one column

I would like to aggregate some columns and rows into one column in prestoSQL table.
with example_table as (
select * from (
values ('A', 'nh', 7), ('A', 'mn', 4), ('A', 'sv', 3),
('B', 'tb', 6), ('B', 'ty', 5), ('A', 'rw', 2),
('C', 'op', 9), ('C', 'au', 8)
) example_table("id", "time", "value")
)
select id, agg(value, time) # Unexpected parameters (integer, VARCHAR(2)) for function array_agg. Expected: array_agg(T) T
from example_table
group by id
I would like to combine column "time" and "value" as one column and then aggregate all rows by "id" such that
id. time_value_agg
A. [['nh', 7], ['mn', 4], ['sv', 3], ['rw', 2]
B. [['tb', 6], ['tv',5]
C. [['op', 9], ['au', 8]]
the column
time_value_agg
should be an array of str. If the "time" col is not str, cast it to str.
I am not sure which function can be used for this ?
thanks
array_agg can be applied to single column only. If times are unique per id you can turn data into map:
select id, map(array_agg(time), array_agg(value)) time_value_agg
from example_table
group by id
Output:
id
time_value_agg
C
{op=9, au=8}
A
{mn=4, sv=3, rw=2, nh=7}
B
{ty=5, tb=6}
Or turn data into ROW type (or map) before aggregation:
select id,
array_agg(arr) time_value_agg
from (
select id, cast (row(time, value) as row(time varchar, value integer))arr
from example_table
)
group by id
Output:
id
time_value_agg
C
[{time=op, value=9}, {time=au, value=8}]
A
[{time=nh, value=7}, {time=mn, value=4}, {time=sv, value=3}, {time=rw, value=2}]
B
[{time=tb, value=6}, {time=ty, value=5}]

Conditional group by with window function in Snowflake query

I have a table in Snowflake in following format:
create temp_test(name string, split string, value int)
insert into temp_test
values ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400), ('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000), ('B', 'd',4000), ('B','e', 5000)
First step, I needed only top 2 value per name (sorted on value), so I used following query to get that:
select name, split, value,
row_number() over (PARTITION BY (name) order by value desc) as row_num
from temp_test
qualify row_num <= 2
Which gives me following resultset:
NAME SPLIT VALUE ROW_NUM
A e 500 1
A d 400 2
B e 5000 1
B d 4000 2
Now, I need to sum values other than Top 2 and put it in a different Split named as "Others", like this:
NAME SPLIT VALUE
A e 500
A d 400
A Others 600
B e 5000
B d 4000
B Others 6000
How to do that in Snowflake query or SQL in general?
with data as (
select name, split, value,
row_number() over (partition by (name) order by value desc) as row_num
from temp_test
)
select
name,
case when row_num <= 2 then split else 'Others' end as split,
sum(value) as value
from data
group by name, case when row_num <= 2 then row_num else 3 end
Shawnt00's answer is good, but for the record in Snowflake this can be written simpler:
Firstly the group by at the end can refer to the results by index or name:
GROUP BY 1,2
or
GROUP BY name, split
also as the CASE only has too branches an IFF can be used and seems you are using a CTE to add the row_number you can push the IFF into the CTE also
WITH data AS (
SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) AS row_num,
IFF(row_num < 3, split, 'Others') as n_split
FROM VALUES ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400),
('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000),
('B', 'd',4000), ('B','e', 5000)
v(name, split, value)
)
SELECT
name,
n_split,
SUM(value) AS value
FROM data
GROUP BY name, n_split;
and if super keen on small SQL push the ROW_NUMBER into the IFF:
WITH data AS (
SELECT name, value,
IFF(ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) < 3, split, 'Others') as n_split
FROM VALUES ('A','a', 100), ('A','b', 200), ('A','c',300), ('A', 'd', 400),
('A', 'e',500), ('B', 'a', 1000), ('B','b', 2000), ('B','c', 3000),
('B', 'd',4000), ('B','e', 5000)
v(name, split, value)
)
SELECT
name,
n_split AS split,
SUM(value) AS value
FROM data
GROUP BY name, n_split;
gives:
NAME SPLIT VALUE
A e 500
A d 400
A Others 600
B e 5000
B d 4000
B Others 6000

Python sqlite3 SQL query Get all entries with newest date but limit per single unique column

I have a table called 'fileEvents'. It has four columns (there are more but not relevant to the question): id, fileId, action and time.
The same fileId, action and time values can appear in multiple rows.
The query I want is simple but I can't think of a working one: Get the latest entry since a specific time for every fileId.
I tried the following.
First I will try to just get all entries sorted by time since a specific time:
SELECT * FROM `fileEvents` ORDER BY `time` DESC WHERE `time` < 1000
The result is of course fine (id, action, fileId, time):
[(6, 0, 3, 810), (5, 0, 3, 410), (2, 0, 1, 210), (3, 0, 2, 210), (4, 0, 3, 210), (1, 0, 1, 200)]
So it is all sorted. But now I only want unique fileIds. So I add a GROUP BYfileId`:
SELECT * FROM `fileEvents` GROUP BY `fileId` ORDER BY `time` DESC WHERE `time` < 1000
Which of course is wrong. Because first it will group the results and then sort them, but they are already grouped so there is no sorting:
[(3, 0, 2, 210), (4, 0, 3, 210), (1, 0, 1, 200)]
When I try to reverse the GROUP BY and ORDER BY, I get a OperationalError: near "GROUP": syntax error
Also when I try to do a sub query where I first get the sorted list and then group them the result is wrong:
SELECT * FROM `fileEvents` WHERE `id` IN (
SELECT `id` FROM `fileEvents` ORDER BY `time` DESC WHERE `time` < 1000
) GROUP BY `fileId`
With the (wrong) result:
[(1, 0, 1, 200), (3, 0, 2, 210), (4, 0, 3, 210)]
The result I am looking for is:
[(6, 0, 3, 810), (2, 0, 1, 210), (3, 0, 2, 210)]
Does anyone have an idea how I could get the result I want? What am I missing?
Thanks a lot!
With ROW_NUMBER() window function:
select * -- replace * with the columns that you want in the result
from (
select *, row_number() over (partition by fileid order by time desc) rn
from fileevents
where time < 1000
) t
where rn = 1
A typical solution to this top-1-per-group problem is to filter with a correlated subquery:
select fe.*
from fileevents fe
where fe.time = (
select max(fe1.time)
from fileevents fe1
where fe1.fileid = fe.fileid and fe1.time < 1000
)
For performance with this query, you want an index on (fileid, time).

PostgesSQL - SELECT rows only for the latest range found

I have this select statement
SELECT id, liked, markers, search_body, remote_bare_jid, direction
FROM mam_message where user_id='20' AND remote_bare_jid =
'5a95c47078f92c6337019521' ORDER BY id DESC;
that returns the following
I want to retrieve rows of the latest range direction 'I' -> 'I'
THIS:
SELECT id, liked, markers, search_body, remote_bare_jid, direction
FROM mam_message
where user_id ='20'
AND remote_bare_jid = '5a95c47078f92c6337019521'
ORDER BY id DESC
limit 4;
Even when the range is not on top
I am still able to get only the latest range of direction 'I'
THIS (the highligted):
You can find transition rows (where direction changes from O to I) using the window function lag(). Mark these rows as 1 (0 the others). Next, calculate cumulative sum of these marks. The group sought will have the sum = 1. Example:
with example(id, direction) as (
values
(1, 'O'),
(2, 'I'),
(3, 'I'),
(4, 'I'),
(5, 'O'),
(6, 'I')
)
select id, direction
from (
select id, direction, sum(mark) over w
from (
select
id, direction,
(lag(direction, 1, 'O') over w = 'O' and direction = 'I')::int mark
from example
window w as (order by id)
) s
window w as (order by id)
) s
where direction = 'I' and sum = 1
order by id
id | direction
----+-----------
2 | I
3 | I
4 | I
(3 rows)

Logic to check if exact ids (3+ records) are present in a group in SQL Server

I have some sample data like:
INSERT INTO mytable
([FK_ID], [TYPE_ID])
VALUES
(10, 1),
(11, 1), (11, 2),
(12, 1), (12, 2), (12, 3),
(14, 1), (14, 2), (14, 3), (14, 4),
(15, 1), (15, 2), (15, 4)
Now, here I am trying to check if in each group by FK_ID we have exact match of TYPE_ID values for 1, 2 & 3.
So, the expected output is like:
(10, 1) this should fail
As in group FK_ID = 10 we only have one record
(11, 1), (11, 2) this should also fail
As in group FK_ID = 11 we have two records.
(12, 1), (12, 2), (12, 3) this should pass
As in group FK_ID = 12 we have two records.
And all the TYPE_ID are exactly matching 1, 2 & 3 values.
(14, 1), (14, 2), (14, 3), (14, 4) this should also fail
As we have 4 records here.
(15, 1), (15, 2), (15, 4) this should also fail
Even though we have three records, it should fail as the TYPE_ID here (1, 2, 4) are not matching with required match (1, 2, 3).
Here is my attempt:
select * from mytable t1
where exists (select COUNT(t2.TYPE_ID)
from mytable t2 where t2.FK_ID = t1.FK_ID
and t2.TYPE_ID IN (1, 2, 3)
group by t2.FK_ID having COUNT(t2.TYPE_ID) = 3);
This is not working as expected, because it also pass for FK_ID = 14 which has four records.
Demo: SQL Fiddle
Also, how we can make it generic so that if we need to check for 4 or more TYPE_ID values like (1,2,3,4) or (1,2,3,4,5), we can do that easily by updating few values.
The following query will do what you want:
select fk_id
from t
group by fk_id
having sum(case when type_id in (1, 2, 3) then 1 else 0 end) = 3 and
sum(case when type_id not in (1, 2, 3) then 1 else 0 end) = 0;
This assumes that you have no duplicate pairs (although depending on how you want to handle duplicates, it might be as easy as using, from (select distinct * from t) t).
As for "genericness", you need to update the in lists and the 3.
If you want something more generic:
with vals as (
select id
from (values (1), (2), (3)) v(id)
)
select fk_id
from t
group by fk_id
having sum(case when type_id in (select id from vals) then 1 else 0 end) = (select count(*) from vals) and
sum(case when type_id not in (select id from vals) then 1 else 0 end) = 0;
You can use this code:
SELECT y.fk_id FROM
(SELECT x.fk_id, COUNT(x.type_id) AS count, SUM(x.type_id) AS sum
FROM mytable x GROUP BY (x.fk_id)) AS y
WHERE y.count = 3 AND y.sum = 6
For making it generic, you can equal y.count with N and y.sum with N*(N-1)/2, where N is the number you are looking for (1, 2, ..., N).
You can try this query. COUNT and DISTINCT used for eliminate duplicate records.
SELECT
[FK_ID]
FROM
#mytable T
GROUP BY
[FK_ID]
HAVING
COUNT(DISTINCT CASE WHEN [TYPE_ID] IN (1,2,3) THEN [TYPE_ID] END) = 3
AND COUNT(CASE WHEN [TYPE_ID] NOT IN (1,2,3) THEN [TYPE_ID] END) = 0
Try this:
select FK_ID,count(distinct TYPE_ID) from mytable
where TYPE_ID<=3
group by FK_ID
having count(distinct TYPE_ID)=3
You should use CTE with Dynamic pass Value which you have mentioned in Q.
WITH CTE
AS (
SELECT FK_ID,
COUNT(*) CNT
FROM #mytable
GROUP BY FK_ID
HAVING COUNT(*) = 3) <----- Pass Value here What you want to Display Result,
CTE1
AS (
SELECT T.[ID],
T.[FK_ID],
T.[TYPE_ID],
ROW_NUMBER() OVER(PARTITION BY T.[FK_ID] ORDER BY
(
SELECT NULL
)) RN
FROM #mytable T
INNER JOIN CTE C ON C.FK_ID = T.FK_ID),
CTE2
AS (
SELECT C1.FK_ID
FROM CTE1 C1
GROUP BY C1.FK_ID
HAVING SUM(C1.TYPE_ID) = SUM(C1.RN))
SELECT TT1.*
FROM CTE2 C2
INNER JOIN #mytable TT1 ON TT1.FK_ID = C2.FK_ID;
From above SQL Command which will produce Result (I have passed 3) :
ID FK_ID TYPE_ID
4 12 1
5 12 2
6 12 3