Optimizing SQL Cross Join that checks if any array value in other column - sql

Let's say I have a table events with structure:
id
value_array
XXXX
[a,b,c,d]
...
...
I have a second table values_of_interest with structure:
value
x
y
z
a
I want to find id's that have any of the values found in values_of_interest. All else equal, what would be the most performant SQL to make this happen? (I am using BigQuery, but feel free to answer more generally)
My current thought is:
SELECT
DISTINCT e.id
FROM
events e, values_of_interest vi
WHERE
EXISTS(
SELECT
value
FROM
UNNEST(e.value_array) value
JOIN
vi ON vi.value = e.value
)

Few quick options for BigQuery Standard SQL
Option 1
select id
from `project.dataset.events`
where exists (
select 1
from `project.dataset.values_of_interest`
where value in unnest(value_array)
)
Option 2
select id
from `project.dataset.events` t
where (
select count(1)
from t.value_array as value
join `project.dataset.values_of_interest`
using(value)
) > 0

I would write this using exists and a join:
select e.id
from `project.dataset.events` e
where exists (select 1
from unnest(e.value_array) val join
`project.dataset.values_of_interest` voi
on val = voi.value
);

Related

How to convert list of comma separated Ids into their name?

I have a table that contains:
id task_ids
1 10,15
2 NULL
3 17
I have the table that has the names of this tasks:
id task_name
10 a
15 b
17 c
I want to generate the following output
id task_ids task_names
1 10,15 a,b
2 null null
3 17 c
I know this structure isn't ideal but this is legacy table which I will not change now.
Is there easy way to get the output ?
I'm using Presto but I think this can be solved with native sql
WITH data AS (
SELECT * FROM (VALUES (1, '10,15'), (2, NULL)) x(id, task_ids)
),
task AS (
SELECT * FROM (VALUES ('10', 'a'), ('15', 'b')) x(id, task_name)
)
SELECT
d.id, d.task_ids
-- array_agg will obviously capture NULL task_name comping from LEFT JOIN, so we need to filter out such results
IF(array_agg(t.task_name) IS NOT DISTINCT FROM ARRAY[NULL], NULL, array_agg(t.task_name)) task_names
FROM data d
-- split task_ids by `,`, convert into numbers, UNNEST into separate rows
LEFT JOIN UNNEST (split(d.task_ids, ',')) AS e(task_id) ON true
-- LEFT JOIN with task to pull the task name
LEFT JOIN task t ON e.task_id = t.id
-- aggregate back
GROUP BY d.id, d.task_ids;
You have a horrible data model, but you can do what you want with a bit of effort. Arrays are better than strings, so I'll just use that:
select t.id, t.task_id, array_agg(tt.task_name) as task_names
from t left join lateral
unnest(split(t.task_ids, ',')) u(task_id)
on 1=1 left join
tasks tt
on tt.task_id = u.task_id
group by t.id, t.task_id;
I don't have Presto on hand to test this. But this or some minor variant should do what you want.
EDIT:
This version might work:
select t.id, t.task_id,
(select array_agg(tt.task_name)
from unnest(split(t.task_ids, ',')) u(task_id) join
tasks tt
on tt.task_id = u.task_id
) as task_names
from t ;

Oracle SQL XOR condition with > 14 tables

I have a question on sql desgin.
Context:
I have a table called t_master and 13 other tables (lets call them a,b,c... for simplicity) where it needs to compared.
Logic:
t_master will be compared to table 'a' where t_master.gen_val =
a.value.
If record exist in t_master, retrieve t_master record, else retrieve 'a' record.
I do not need to retrieve the records if it exists in both tables (t_master and a) - XOR condition
Repeat this comparison with the remaining 12 tables.
I have some idea on doing this, using WITH to subquery the non-master tables (a,b,c...) first with their respective WHERE clause.
Then use XOR statement to retrieve the records.
Something like
WITH a AS (SELECT ...),
b AS (SELECT ...)
SELECT field1,field2...
FROM t_master FULL OUTER JOIN a FULL OUTER JOIN b FULL OUTER JOIN c...
ON t_master.gen_value = a.value
WHERE ((field1 = x OR field2 = y ) AND NOT (field1 = x AND field2 = y))
AND ....
.
.
.
.
Seeing that I have 13 tables that I need to full outer join, is there a better way/design to handle this?
Otherwise I would have at least 2*13 lines of WHERE clause which I'm not sure if that will have impact on the performance as t_master is sort of a log table.
**Assume I cant change any schema.
Currently I'm not sure if this SQL will working correctly yet, so I'm hoping someone can guide me in the right direction regarding this.
update from used_by_already's suggestion:
This is what I'm trying to do (comparison between 2 tables first, before I add more, but I am unable to get values from ATP_R.TBL_HI_HDR HI_HDR as it is in the NOT EXISTS subquery.
How do i overcome this?
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO JOIN ATP_R.TBL_HI_HDR HI_HDR ON LOG_REPO.GEN_VAL = HI_HDR.HI_NO
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_R.TBL_HI_HDR HI_HDR
WHERE LOG_REPO.GEN_VAL = HI_HDR.HI_NO
)
UNION ALL
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_R.TBL_HI_HDR HI_HDR JOIN ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO ON HI_HDR.HI_NO = LOG_REPO.GEN_VAL
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO
WHERE HI_HDR.HI_NO = LOG_REPO.GEN_VAL
)
Full outer joins used to exclude all matching rows can be an expensive query. You don't supply much detail, but perhaps using NOT EXISTS would be simpler and maybe it will produce a better explain plan. Something along these lines.
select
cola,colb,colc
from t_master m
where not exists (
select null from a where m.keycol = a.fk_to_m
)
and not exists (
select null from b where m.keycol = b.fk_to_m
)
and not exists (
select null from c where m.keycol = c.fk_to_m
)
union all
select
cola,colb,colc from a
where not exists (
select null from t_master m where a.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from b
where not exists (
select null from t_master m where b.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from c
where not exists (
select null from t_master m where c.fk_to_m = m.keycol
)
You could union the 13 a,b,c ... tables to simplify the coding, but that may not perform so well.

Compare fields from different rows

First off I am using SQL Server.
I am joining a table on itself like in the example below:
SELECT t.theDate,
s.theDate,
t.bitField,
s.bitField,
t.NAME,
s.NAME
FROM table1 t
INNER JOIN table1 s ON t.NAME = s.NAME
If I take a random row (i.e. X) from the dataset produced.
Can I compare values in any field on row X to values in any field on row X-1 OR row X+1?
Example: I want to compare t.theDate on row 5 to s.theDate on row 4 or s.theDate on row 3.
Sample data looks like:
Desired results:
I want to pull all pairs of rows where the t.bitfield and s.bitfield are opposite and t.theDate and s.theDate are opposite.
From the image the would be row (3 & 4), (5 & 6), (7 & 8) ... etc.
I really appreciate any help!
Can it be done?
Varinant 1: It looks like you would like to use ranking function.
if objcet_id('tempdb..#TmpOrderedTable') is not null drop table #TmpOrderedTable
select *, row_number(order by columnlist, (select 0)) rn
into #TmpOrderedTable
from table1 t
select *
from #TmpOrderedTable t0
inner join #TmpOrderedTable tplus on t0.rn = tplus.rn + 1 -- next one
inner join #TmpOrderedTable tminus on t0.rn = tminus.rn - 1 -- previous one
Varinant 2:
To get scalar values you can use ranking function lag and lead. Or subquery.
Varinant 3:
You can use selfjoin, but you have to specify unique nonarbitary key if you don't want duplicates.
Varinant 4:
You can use apply.
Your question isn't too clear, so i hope it was your goal.
How about this?
WITH ts as (
SELECT t.theDate as theDate1, s.theDate as theDate2,
t.bitField as bitField1, s.bitField as bitField2,
t.NAME -- there is only one name
FROM table1 t INNER JOIN
table1 s
ON t.NAME = s.NAME
)
SELECT ts.*
FROM ts
WHERE EXISTS (SELECT 1
FROM ts ts2
WHERE ts2.name = ts.name AND
ts2.theDate1 = ts.theDate2 AND
ts2.theDate2 = ts.theDate1 AND
ts2.bitField1 = ts.bitField2 AND
ts2.bitField2 = ts.bitField1
);

PostgreSQL - how to query "result IN ALL OF"?

I am new to PostgreSQL and I have a problem with the following query:
WITH relevant_einsatz AS (
SELECT einsatz.fahrzeug,einsatz.mannschaft
FROM einsatz
INNER JOIN bergefahrzeug ON einsatz.fahrzeug = bergefahrzeug.id
),
relevant_mannschaften AS (
SELECT DISTINCT relevant_einsatz.mannschaft
FROM relevant_einsatz
WHERE relevant_einsatz.fahrzeug IN (SELECT id FROM bergefahrzeug)
)
SELECT mannschaft.id,mannschaft.rufname,person.id,person.nachname
FROM mannschaft,person,relevant_mannschaften WHERE mannschaft.leiter = person.id AND relevant_mannschaften.mannschaft=mannschaft.id;
This query is working basically - but in "relevant_mannschaften" I am currently selecting each mannschaft, which has been to an relevant_einsatz with at least 1 bergefahrzeug.
Instead of this, I want to select into "relevant_mannschaften" each mannschaft, which has been to an relevant_einsatz WITH EACH from bergefahrzeug.
Does anybody know how to formulate this change?
The information you provide is rather rudimentary. But tuning into my mentalist skills, going out on a limb, I would guess this untangled version of the query does the job much faster:
SELECT m.id, m.rufname, p.id, p.nachname
FROM person p
JOIN mannschaft m ON m.leiter = p.id
JOIN (
SELECT e.mannschaft
FROM einsatz e
JOIN bergefahrzeug b ON b.id = e.fahrzeug -- may be redundant
GROUP BY e.mannschaft
HAVING count(DISTINCT e.fahrzeug)
= (SELECT count(*) FROM bergefahrzeug)
) e ON e.mannschaft = m.id
Explain:
In the subquery e I count how many DISTINCT mountain-vehicles (bergfahrzeug) have been used by a team (mannschaft) in all their deployments (einsatz): count(DISTINCT e.fahrzeug)
If that number matches the count in table bergfahrzeug: (SELECT count(*) FROM bergefahrzeug) - the team qualifies according to your description.
The rest of the query just fetches details from matching rows in mannschaft and person.
You don't need this line at all, if there are no other vehicles in play than bergfahrzeuge:
JOIN bergefahrzeug b ON b.id = e.fahrzeug
Basically, this is a special application of relational division. A lot more on the topic under this related question:
How to filter SQL results in a has-many-through relation
Do not know how to explain it, but here is an example how I solved this problem, just in case somebody has the some question one day.
WITH dfz AS (
SELECT DISTINCT fahrzeug,mannschaft FROM einsatz WHERE einsatz.fahrzeug IN (SELECT id FROM bergefahrzeug)
), abc AS (
SELECT DISTINCT mannschaft FROM dfz
), einsatzmannschaften AS (
SELECT abc.mannschaft FROM abc WHERE (SELECT sum(dfz.fahrzeug) FROM dfz WHERE dfz.mannschaft = abc.mannschaft) = (SELECT sum(bergefahrzeug.id) FROM bergefahrzeug)
)
SELECT mannschaft.id,mannschaft.rufname,person.id,person.nachname
FROM mannschaft,person,einsatzmannschaften WHERE mannschaft.leiter = person.id AND einsatzmannschaften.mannschaft=mannschaft.id;

Why is my SQL 'NOT IN' clause producing different results from 'NOT EXISTS'

I have two SQL queries producing different results when I would expect them to produce the same result. I am trying to find the number of events that do not have a corresponding location. All locations have an event but events can also link to non-location records.
The following query produces a count of 16244, the correct value.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE NOT EXISTS
(SELECT * FROM locations AS l WHERE l.event_id = e.event_id)
The following query produces a count of 0.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The following SQL does some summaries of the data set
SELECT 'Event Count',
COUNT(DISTINCT event_id)
FROM events
UNION ALL
SELECT 'Locations Count',
COUNT(DISTINCT event_id)
FROM locations
UNION ALL
SELECT 'Event+Location Count',
COUNT(DISTINCT l.event_id)
FROM locations AS l JOIN events AS e ON l.event_Id = e.event_id
And returns the following results
Event Count 139599
Locations Count 123355
Event+Location Count 123355
Can anyone shed any light on why the 2 initial queries do not produce the same figure.
You have a NULL in the subquery SELECT l.event_id FROM locations AS l so NOT IN will always evaluate to unknown and return 0 results
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The reason for this behaviour can be seen from the below example.
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x'
<> 'b'
≡ Unknown and True and True
≡ Unknown
The NOT IN form works differently for NULLs. The presence of a single NULL will cause the entire statement to fail, thus returning no results.
So you have at least one event_id in locations that is NULL.
Also, your query might be better written as a join:
SELECT
COUNT(DISTINCT e.event_id)
FROM
events AS e
LEFT JOIN locations AS l ON e.event_id = l.event_id
WHERE
l.event_id IS NULL
[UPDATE: apparently, the NOT EXISTS version is faster.]
In and Exists are processed very very differently.
Select * from T1 where x in ( select y from T2 )
is typically processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to
the original table -- typically.
As opposed to
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index
on T1(x).