BigQuery WHERE key NOT IN mismatch? - google-bigquery

How is this possible?
SELECT DISTINCT key FROM dataset.first_table
-- 5,026,143
SELECT DISTINCT key FROM dataset.first_table
WHERE key IN (SELECT key FROM dataset.second_table)
-- 2,630,635
SELECT DISTINCT key FROM dataset.first_table
WHERE key NOT IN (SELECT key FROM dataset.second_table)
-- 0
How can the last statement return no results?
I don't know what to add here. I guess it's just some kind of weird syntax mistake.
I'm sure that second_table does not contain all keys from the first_table:
SELECT key FROM dataset.first_table LIMIT 1
-- "a"
SELECT key FROM dataset.second_table WHERE key = "a"
-- no results
Also:
SELECT DISTINCT key FROM dataset.first_table
LEFT JOIN dataset.second_table USING (key)
WHERE second_table.key IS NULL
-- 2,395,612

I think key column in your dataset.second_table has null values.
Would you try below and see if it works ?
SELECT DISTINCT key FROM dataset.first_table
WHERE key NOT IN (SELECT key FROM dataset.second_table WHERE key IS NOT NULL);
Semantic rules of IN operator :
When using the NOT IN operator, the following semantics apply in this order:
Returns TRUE if value_set is empty.
Returns NULL if search_value is NULL.
Returns FALSE if value_set contains a value equal to search_value.
Returns NULL if value_set contains a NULL.
Returns TRUE.

Related

Getting null result when applying not in SQL operator?

I have two tables: opcsourcetags and real_raw_ponts.
There is a foreign key of opcsourcetags in real_raw_points table. I want to get the rows from opcsourcetags against which there is not ID in real_raw_ponits but I am getting the null result. Here is my query:
select *
from OPC_SourceTags opc
where opc.Source_Tag_Id not in (
select rt.Source_Tag_Id_Fk
from Real_Raw_Points rt
)
You should be able to do the following
SELECT opc.* FROM OPC_SourceTags opc
LEFT JOIN Real_Raw_Points rt ON (rt.Source_Tag_Id = opc.Source_Tag_Id)
WHERE rt.Source_Tag_Id is null;
This is because IN/NOT IN uses 3-valued logic. In this case you've used NOT IN, ie NOT TRUE. SQL Server when checking the list of values you supply will evaluate NULL as UNKNOWN, therefore it is unknown whether the source tag id appears in the set.
IN will discard rows it cannot say with certainty are TRUE, whereas NOT IN will return UNKNOWN ie NULL for the entire set, as it cannot say with certainty that the value among the list of values you provided.
I would recommend reading more about 3-valued logic, as it's not a simple thing to explain. You can either use Mukesh's method, but this will not return rows where Source_Tag_Id_Fk is NULL in the Source_Tag_Id_Fk table. Best practice is to use NOT EXISTS instead of NOT IN. NOT EXISTS uses two-valued logic.
select *
from OPC_SourceTags opc
where not exists (
select rt.Source_Tag_Id_Fk
from Real_Raw_Points rt
where opc.Source_Tag_Id = rt.Source_Tag_id_Fk
)
Try below-updated query - Just added where condition in your subquery where rt.Source_Tag_Id_Fk is not null
Note: If there is a null value in this column rt.Source_Tag_Id_Fk the whole result will be null. if there is no null value then your query will also work fine hence you must apply is not null on rt.Source_Tag_Id_fk column in your subquery
select *
from OPC_SourceTags opc
where opc.Source_Tag_Id not in (
select rt.Source_Tag_Id_Fk
from Real_Raw_Points rt where rt.Source_Tag_Id_Fk is not null
)

How to join on a nested value from a jsonb column?

I have a PostgreSQL 11 database with these tables:
CREATE TABLE stats (
id integer NOT NULL,
uid integer NOT NULL,
date date NOT NULL,
data jsonb DEFAULT '[]'::json NOT NULL
);
INSERT INTO stats(id, uid, date, data) VALUES
(1, 1, '2020-10-01', '{"somerandomhash":{"source":"thesource"}}');
CREATE TABLE links(
id integer NOT NULL,
uuid uuid NOT NULL,
path text NOT NULL
);
INSERT INTO links(id, uuid, path) VALUES
(1, 'acbd18db-4cc2-f85c-edef-654fccc4a4d8', 'thesource');
My goal is to create a new table reports with data from the stats table, but with a new key from the links table. It will look like this:
CREATE TABLE reports(
id integer NOT NULL,
uid integer NOT NULL,
date date NOT NULL,
data jsonb DEFAULT '[]'::json NOT NULL
);
INSERT INTO reports(id, uid, date, data) VALUES
(1, 1, 2020-10-01, {"uuid":{"source":"thesource"});
To this end, I tried to left join the table links in order to retrieve the uuid column value - without luck:
SELECT s.uid, s.date, s.data->jsonb_object_keys(data)->>'source' as path, s.data->jsonb_object_keys(data) as data, l.uuid
FROM stats s LEFT JOIN links l ON s.data->jsonb_object_keys(data)->>'source' = l.path
I tried to use the result of s.data->jsonb_object_keys(data)->>'source' in the left join, but got the error:
ERROR: set-returning functions are not allowed in JOIN conditions
I tried using LATERAL but still not valid result.
How to make this work?
jsonb_object_keys() is a set-returning function which cannot be used the way you do - as the error messages tells you. What's more, json_object_keys() returns top-level key(s), but it seems you are only interested in the value. Try jsonb_each() instead:
SELECT s.id
, s.uid
, s.date
, jsonb_build_object(l.uuid::text, o.value) AS new_data
FROM stats s
CROSS JOIN LATERAL jsonb_each(s.data) o -- defaults to column names (key, value)
LEFT JOIN links l ON l.path = o.value->>'source';
db<>fiddle here
jsonb_each() returns top-level key and value. Proceed using only the value.
The nested JSON object seems to have the constant key name 'source'. So the join condition is l.path = o.value->>'source'.
Finally, build the new jsonb value with jsonb_build_object().
While this works as demonstrated, a couple of questions remain:
The above assumes there is always exactly one top-level key in stats.data. If not, you'd have to define what to do ...
The above assumes there is always exactly one match in table links. If not, you'd have to define what to do ...
Most importantly:
If data is as regular as you make it out to be, consider a plain "uuid" column (or drop it as the value is in table links anyway) and a plain column "source" to replace the jsonb column. Much simpler and more efficient.
It looks like that you want to join by the "source" key from the JSON column.
Instead of
s.data->jsonb_object_keys(data)->>'source'
Try this
s.data ->> 'source'
If my assumptions are correct the whole query can go like that:
SELECT
s.uid,
s.date,
s.data ->> 'source' AS path,
s.data -> jsonb_object_keys(data) AS data,
l.uuid
FROM stats s
LEFT JOIN links l ON s.data ->> 'source' = l.path

How to write a select statement that outputs all key values in all rows

My hive table has a map of none or many key value pairs. I don't even know most of the keys. I want to write a select statement that outputs all key values in all rows.
something like
select t.additional_fields[*]
from mytable as t
map_keys(map<K,V>) returns array of all keys, you can explode it. The following query will return all distinct keys:
select
s.key
from
(
select m.key
from mytable t
lateral view explode(map_keys(t.additional_fields)) m as key
) s
group by s.key

How to update a table if values of the attributes are contained within another table?

I've got a database like this one:
I'm trying to create a query that would enable me to update the value of the status attribute inside the incident table whenever the values of all of these three attributes: tabor_vatrogasci, tabor_policija, and tabor_hitna are contained inside the izvještaj_tabora table as a value of the oznaka_tabora attribute. If, for example, the values of the tabor_vatrogasci, tabor_policija, and tabor_hitna attributes are 3, 4 and 5 respectively, the incident table should be updated if (and only if) 3, 4, and 5 are contained inside the izvještaj_tabora table.
This is what I tried, but it didn't work:
UPDATE incident SET status='Otvoren' FROM tabor,izvjestaj_tabora
WHERE (incident.tabor_policija=tabor.oznaka
OR incident.tabor_vatrogasci=tabor.oznaka
OR incident.tabor_hitna=tabor.oznaka)
AND izvjestaj_tabora.oznaka_tabora=tabor.oznaka
AND rezultat_izvjestaja='Riješen' AND
((SELECT EXISTS(SELECT DISTINCT oznaka_tabora FROM izvjestaj_tabora)
WHERE oznaka_tabora=incident.tabor_policija) OR tabor_policija=NULL) AND
((SELECT EXISTS(SELECT DISTINCT oznaka_tabora FROM izvjestaj_tabora)
WHERE oznaka_tabora=incident.tabor_vatrogasci) OR tabor_vatrogasci=NULL) AND
((SELECT EXISTS(SELECT DISTINCT oznaka_tabora FROM izvjestaj_tabora)
WHERE oznaka_tabora=incident.tabor_hitna) OR tabor_hitna=NULL);
Does anyone have any idea on how to accomplish this?
Asuming INCIDENT.OZNAKA is the key and you need all 3 to be ralated for the event to open (I am Slovenian that why I understand ;) )
UPDATE incident
SET status='Otvoren'
WHERE oznaka in (
SELECT DISTINCT i.oznaka
FROM incident i
INNER JOIN izvještaj_tabora t1 ON i.tabor_vatrogasci = t1.oznaka_tabora
INNER JOIN izvještaj_tabora t2 ON i.tabor_policija = t2.oznaka_tabora
INNER JOIN izvještaj_tabora t3 ON i.tabor_hitna = t3.oznaka_tabora
WHERE t1.rezultat_izvjestaja='Riješen' AND t2.rezultat_izvjestaja='Riješen' AND t3.rezultat_izvjestaja='Riješen'
)
According to your description the query should look something like this:
UPDATE incident i
SET status = 'Otvoren'
WHERE (tabor_policija IS NULL OR
EXISTS (
SELECT 1 FROM izvjestaj_tabora t
WHERE t.oznaka_tabora = i.tabor_policija
)
)
AND (tabor_vatrogasci IS NULL OR
EXISTS (
SELECT 1 FROM izvjestaj_tabora t
WHERE t.oznaka_tabora = i.tabor_vatrogasci
)
)
AND (tabor_hitna IS NULL OR
EXISTS (
SELECT 1 FROM izvjestaj_tabora t
WHERE t.oznaka_tabora = i.tabor_hitna
)
)
I wonder though, why the connecting table tabor is irrelevant to the operation.
Among other things you fell victim to two widespread misconceptions:
1)
tabor_policija=NULL
This expression aways results in NULL. Since NULL is considered "unknown", if you compare it to anything, the outcome is "unknown" as well. I quote the manual on Comparison Operators:
Do not write expression = NULL because NULL is not "equal to" NULL.
(The null value represents an unknown value, and it is not known
whether two unknown values are equal.)
2)
EXISTS(SELECT DISTINCT oznaka_tabora FROM ...)
In an EXISTS semi-join SELECT items are completely irrelevant. (I use SELECT 1 instead). As the term implies, only existence is checked. The expression returns TRUE or FALSE, SELECT items are ignored. It is particularly pointless to add a DISTINCT clause there.

Excluding a Null value returns 0 rows in a sub query

I'm trying to clean up some data in SQL server and add a foreign key between the two tables.
I have a large quantity of orphaned rows in one of the tables that I would like to delete. I don't know why the following query would return 0 rows in MS SQL server.
--This Query returns no Rows
select * from tbl_A where ID not in ( select distinct ID from tbl_B
)
When I include IS NOT NULL in the subquery I get the results that I expect.
-- Rows are returned that contain all of the records in tbl_A but Not in tbl_B
select * from tbl_A where ID not in ( select distinct ID from tbl_B
where ID is not null )
The ID column is nullable and does contain null values. IF I run just the subquery I get the exact same results except the first query returns one extra NULL row as expected.
This is the expected behavior of the NOT IN subquery. When a subquery returns a single null value NOT IN will not match any rows.
If you don't exclusively want to do a null check, then you will want to use NOT EXISTS:
select *
from tbl_A A
where not exists (select distinct ID
from tbl_B b
where a.id = b.id)
As to why the NOT IN is causing issues, here are some posts that discuss it:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL
NOT EXISTS vs NOT IN
What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL?
Matching on NULL with equals (=) will return NULL or UNKNOWN as opposed to true/false from a logic standpoint. E.g. see http://msdn.microsoft.com/en-us/library/aa196339(v=sql.80).aspx for discussion.
If you want to include finding NULL values in table A where there is no NULL in table B (if B is the "parent" and A is the "child" in the "foreign key" relationship you desire) then you would need a second statement, something like the following. Also I would recommend qualifying the ID field with a table prefix or alias since the field names are the same in both tables. Finally, I would not recommend having NULL values as the key. But in any case:
select * from tbl_A as A where (A.ID not in ( select distinct B.ID from tbl_B as B ))
or (A.ID is NULL and not exists(select * from tbl_B as B where B.ID is null))
The problem is the non-comparability of nulls. If you are asking "not in" and there are nulls in the subquery it cannot say that anything anything is definitely not in becuase it is looking at those nulls as "unknown" and so the answer is always "unknown" in the three value logic that SQL uses.
Now of course that is all assuming you have ANSI_NULLS ON (which is the default) If you turn that off then suddenly NULLS become comparable and it will give you results, and probably the results you expect.
If the ids are never negative, you might consider something like:
select *
from tbl_A
where coalesce(ID, -1) not in ( select distinct coalesce(ID, -1) from tbl_B )
(Or if id is a string, use something line coalesce(id, '<null>')).
This may not work in all cases, but it has the virtue of simplicity on the coding level.
You probably have ANSI NULLs switched off. This compares null values so null=null will return true.
Prefix the first query with
SET ANSI_NULLS ON
GO