I'm working with AWS Athena which uses Presto. Let's say I have a SQL table with columns A, B, C, and D. Assume table is sorted by column C, ascending.
I need to compare each row to all the other rows and check if current row's D value is the maximum value out of all rows whose C values are less than current row's C value. Then append a boolean value in column F. Code in Python would look something like:
D_val_list = []
for index, row in df.iterrows():
max_val_D = df[:index]['D'].max() #Sorted on column C
if row['D'] < max_val_D:
D_val_list.append(FALSE)
else:
D_val_list.append(TRUE)
df['F'] = D_val_list
Using the provisional jupyter notebook in Athena times out (the dataset is millions of rows long) and I figure connecting to AWS via local jupyter instance would have similar issues.
In SQL, you would use window functions -- something like this:
select t.*,
(case when d < coalesce(max(d) over (order by c
rows between unbounded preceding and 1 preceding) is null,
d + 1
then 1 else 0
end) as flag
from t;
This logic would work assuming that c is unique. That said, there might be alternative depending on the exact nature of the data.
You have to discretely order your rows on c in Athena because of its distributed nature. You can use window functions on top of the ordered set to achieve your desired results:
SELECT
a,
b,
c,
d,
CASE WHEN d>lag(max_so_far) OVER () THEN true ELSE false END as f
FROM (
SELECT a,
b,
c,
d,
max(d) OVER (rows BETWEEN unbounded preceding AND current row) AS max_so_far
FROM (
-- sorted ON c
SELECT
a,
b,
c,
d
FROM dataset.table
ORDER BY c
)
)
Related
I have a table with two columns, id and next. Both would be the SHA256 of a file, id being the primary key, and next being nullable, referencing another row's id.
What I'm trying to do is select the rows from a table ordered randomly, but at the same time: if a row contains a value in next, the next row's id/pk MUST be the value of next, from the previous row. It would essentially be a random query, but keeping certain rows that depend on each other in a sequence.
The random part would be easy, just something like SELECT * FROM table ORDER BY rand(), but I didn't found anything about ordering based on a previous row's value. Another option would be manually sorting in the client, after the query, but that might be too costly depending on the table's size.
Example:
id
next
a
null
b
c
c
d
d
null
e
null
f
null
g
null
h
e
i
null
Expected result:
id
next
f
null
i
null
h
e
e
null
b
c
c
d
d
null
g
null
a
null
(Note that the results are shuffled, but h is followed by e, b is followed by c, which is followed by d)
Is it possible to so such a query in SQLite?
This is a graph walking problem. I assume that your structure is a set of linked lists:
No cycles.
next is unique
These assumptions are based on your naming.
With this assumption, you can use a relatively simple recursive CTE to construct the path to each id and then order by that path:
with recursive cte as (
select id, next, cast(id as text) as path
from t
where not exists (select 1 from t t2 where t2.next = t.id)
union all
select t.id, t.next, (cte.path || coalesce('->' || t.id, ''))
from cte join
t
on cte.next = t.id
)
select id, next
from cte
order by path;
Here is a db<>fiddle.
i need to filter data using different conditions. One is that I need to queck if the values in one column (column d) are unique IF the values in another column (c) are greater than 1.
Lets assume:
Column a, b, c, d
So I don't want any entries, where c is greater than 1 while d has non unique values.
Select TOP 100 * From table
Where (a = 'Max' AND b = '2019') -- just an additional filter, which always applies
AND (c = 1 -- if c is one, that is fine
OR (c > 1 AND -- here I want to check if c is bigger than 1 AND if d is unique; but thats the part I need help with
);
Thank you very much in advance!
Create a CTE where you count the distinct values of column d and use it in the WHERE clause:
with cte as (
select count(distinct d) counter from tablename
)
...........................................
Where ....(c > 1 AND (select counter from cte) = 1)
Please how to mix 2 tables(A,B) in 1 table(AB) with special order.
It is 2 tables, A and B with only 1 col. So it is a list/array.
I must order the row like this:
A.col1,A.col1,B.col1,B.col1,A.col1,A.col1,B.col1,B.col1,A.col1,A.col1,B.col1,B.col1 and so on.
To see it easily, it must be:
A,A,B,B,A,A,B,B,A,A
So 2 row from A, 2 row from B, 2 row from A, 2 row from B and so on
I would prefer with db2 sql dialog language, but if it isnt specific would be useful in any sql dialog
thanks
Try this:
SELECT Field1 FROM
(SELECT Field1, 1 AS S, ROW_NUMBER() OVER() AS N, FLOOR(ROW_NUMBER() OVER()/2) AS G FROM A
UNION ALL
SELECT Field1, 2 AS S, ROW_NUMBER() OVER() AS N, FLOOR(ROW_NUMBER() OVER()/2) AS G FROM B)
ORDER BY G, S, N
That should work in DB2. Unfortunately I don't have a DB2 database handy to test it so the code goes without any warranty.
Are these queries exactly the same, or is it possible to get different results depending on the data?
SELECT A, B, C, D
FROM Table_A
GROUP BY A, B, C, D , E
HAVING A in (1,2) AND E = 1 AND MIN(status) = 100
SELECT A, B, C, D
FROM Table_A
WHERE A IN (1,2) AND E = 1 AND status = 100
GROUP BY A, B, C, D , E
They're not equal.
When you consider the following block
create table Table_A(A int, B int, C int, D int, E int, status int);
insert into Table_A values(1,1,1,1,1,100);
insert into Table_A values(1,1,1,1,1,10);
insert into Table_A values(2,1,1,1,1,10);
SELECT A, B, C, D, 'First Query' as query
FROM Table_A
GROUP BY A, B, C, D , E
HAVING A in (1,2) AND E = 1 AND MIN(status) = 100;
SELECT A, B, C, D, 'Second Query' as query
FROM Table_A
WHERE A IN (1,2) AND E = 1 AND status = 100
GROUP BY A, B, C, D , E
you get
A B C D query
- - - - -------------
1 1 1 1 Second Query
as a result ( only the second one returns ),
since for both of the groupings 1,1,1,1,1 and 2,1,1,1,1 -> min(status)=10.
For this reason min(status)=100 case never occurs and first query returns no result.
Rextester Demo
A couple of things:
HAVING MIN(status) = 100
and
WHERE status = 100
are different. The where condition filters out anything that is not 100, period -- it's not even evaluated. The having clause only evaluates it after every record has been read and it looks at the result of the aggregate function (min) for the specified grouping.
Also, a more subtle difference is that the "where" clause for non-aggregate functions is preferable because it can make use of any index on the table, and equally important it will prevent records from being grouped and joined.
For example
having E = 1
and
where E = 1
functionally do the same thing. The difference is you need to collect, group and sort a bunch of records only to discard them using "having," whereas the "where" option removes them before any grouping ever occurs. Also, in this example, with the "where" option, you can remove E from the grouping criteria since it is always 1.
At a high level:
The where clause specifies search conditions for the rows
returned by the Query and limits rows to a meaningful set.
The having clause works as a filter on top of grouped rows.
I am seeking a way to SELECT rows conditionally without having only compound key A,B (refer to the picture).
Furthermore, I need to select rows where negative value and positive value of column C is present; skipping 0. There may be any combination of row count with A, B group the minimum is 2 where C has a negative or positive row.
The data found below is already queried.
Note: I was able to add another column D, because we can't use actual values for C:
D = CASE WHEN C < 0 THEN 1 ELSE 2 end
So the logic could be SELECT * WHERE SUM(D) >= 3.
I am fully able to complete this task with another language such as C#, but I have to get this done using only SQL.
I would also like to avoid temporary tables. Column D is not required.
Would this work?
Select tblA.*
FROM tblA
INNER JOIN
(select A,B
from tblA
Group By A,B
HAVING
SUM(case when C<0 then 1 else 2 end) >=3
)X
on X.A=tblA.A and X.B=tblA.B
SQLFiddle
http://sqlfiddle.com/#!9/2078f/2