Why where not exists return exist ids? - sql

my query:
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE NOT EXISTS (
SELECT *
FROM ncbi.affi_known1 AS b
WHERE a.id = b.id
)
limit 5000
it returns:
id
affiliation
4683763
Psychopharmacology Unit, Dorothy Hodgkin Building, University of Bristol, Whitson Street, Bristol, BS1 3NY, UK.
as first row.
but
select * from ncbi.affi_known1 where id = 4683763
do return the data with id = 4683763
both id are int8 type
table a
CREATE TABLE "public"."affiliation" (
"id" int8 NOT NULL,
"affiliation" text COLLATE "pg_catalog"."default",
"tsv_affiliation" tsvector,
CONSTRAINT "affiliation_pkey" PRIMARY KEY ("id")
)
;
CREATE INDEX "affi_idx_tsv" ON "public"."affiliation" USING gin (
to_tsvector('english'::regconfig, affiliation) "pg_catalog"."tsvector_ops"
);
CREATE INDEX "tsv_affiliation_idx" ON "public"."affiliation" USING gin (
"tsv_affiliation" "pg_catalog"."tsvector_ops"
);
table b
CREATE TABLE "ncbi"."affi_known1" (
"id" int8 NOT NULL,
"affi_raw" text COLLATE "pg_catalog"."default",
"affi_main" text COLLATE "pg_catalog"."default",
"affi_known" bool,
"divide" text COLLATE "pg_catalog"."default",
"divide_known" bool,
"sub_divides" text[] COLLATE "pg_catalog"."default",
"country" text COLLATE "pg_catalog"."default",
CONSTRAINT "affi_known_pkey" PRIMARY KEY ("id")
)
;
update:
after create index on id, everything works well.
delete the index, it seems go wrong.
so why primary key id fails there.
update2:
table b is generated from table a, using:
query = '''
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE NOT EXISTS (
SELECT 1
FROM ncbi.affi_known AS b
WHERE a.id = b.id
)
limit 2000000
'''
data = pd.read_sql(query,conn)
while len(data):
for i,row in tqdm(data.iterrows()):
...
curser_insert.execute(
'insert into ncbi.affi_known(id,affi_raw, affi_main ,affi_known,divide,country) values ( %s, %s, %s,%s,%s,%s) ',
[affi_id,affi_raw, affi_main, affi_known,devide,country]
)
conn2.commit()
conn2.commit()
conn.commit()
data = pd.read_sql(query, conn)
and the code exit improperly.

Your understanding of how EXISTS works might be off. Your current exists query is saying that id 4683763 exists in the affiliation table, not the affi_known1 table. So, the following query should return the single record:
SELECT a.id, a.affiliation
FROM public.affiliation a
WHERE a.id = 4683763;

I am assuming the requirement is to fetch rows only when the id is not present in the second table, so you can try this
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE a.id NOT IN (
SELECT id
FROM ncbi.affi_known1
)

If id were an integer, your query would do what you want.
If id is a string, you could have issues with "look-alikes". It is very hard to say what the problem is -- there could be spaces in the id, hidden characters, or something else. And this could be in either table.
Assuming the ids look like numbers, you could filter "bad" ids out using regular expressions:
select id
from ncbi.affi_known1
where not id ~ '^[0-9]*$';

Related

Redshift create list and search different table with it

I think there a few ways to tackle this, but I'm not sure how to do any of them.
I have two tables, the first has ID's and Numbers. The ID's and numbers can potentially be listed more than once, so I create a result table that lists the unique numbers grouped by ID.
My second table has rows (100 million) with the ID and Numbers again. I need to search that table for any ID that has a Number not in the list of Numbers from the result table.
Can redshift do a query based on if the ID matches and the Number exists in the list from the table? Can this all be done in memory/one statement?
DROP TABLE IF EXISTS `myTable`;
CREATE TABLE `myTable` (
`id` mediumint(8) unsigned NOT NULL auto_increment,
`ID` varchar(255),
`Numbers` mediumint default NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1;
INSERT INTO `myTable` (`ID`,`Numbers`)
VALUES
("CRQ44MPX1SZ",1890),
("UHO21QQY3TW",4370),
("JTQ62CBP6ER",1825),
("RFD95MLC2MI",5014),
("URZ04HGG2YQ",2859),
("CRQ44MPX1SZ",1891),
("UHO21QQY3TW",4371),
("JTQ62CBP6ER",1826),
("RFD95MLC2MI",5015),
("URZ04HGG2YQ",2860),
("CRQ44MPX1SZ",1892),
("UHO21QQY3TW",4372),
("JTQ62CBP6ER",1827),
("RFD95MLC2MI",5016),
("URZ04HGG2YQ",2861);
SELECT ID, listagg(distinct Numbers,',') as Number_List, count(Numbers) as Numbers_Count
FROM myTable
GROUP BY ID
AS result
DROP TABLE IF EXISTS `myTable2`;
CREATE TABLE `myTable2` (
`id` mediumint(8) unsigned NOT NULL auto_increment,
`ID` varchar(255),
`Numbers` mediumint default NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1;
INSERT INTO `myTable2` (`ID`,`Numbers`)
VALUES
("CRQ44MPX1SZ",1870),
("UHO21QQY3TW",4350),
("JTQ62CBP6ER",1825),
("RFD95MLC2MI",5014),
("URZ04HGG2YQ",2859),
("CRQ44MPX1SZ",1891),
("UHO21QQY3TW",4371),
("JTQ62CBP6ER",1826),
("RFD95MLC2MI",5015),
("URZ04HGG2YQ",2860),
("CRQ44MPX1SZ",1882),
("UHO21QQY3TW",4372),
("JTQ62CBP6ER",1827),
("RFD95MLC2MI",5016),
("URZ04HGG2YQ",2861);
Pseudo Code
Select ID, listagg(distinct Numbers) as Violation
Where Numbers IN NOT IN result.Numbers_List
or possibly: WHERE Numbers NOT LIKE '%' || result.Numbers_List|| '%'
Desired Output
(“CRQ44MPX1SZ”, ”1870,1882”)
(“UHO21QQY3TW”, ”4350”)
EDIT
Going the JOIN route, I am not getting the right results...but I'm pretty sure my WHERE implementation is wrong.
SELECT mytable1.ID, listagg(distinct mytable2.Numbers, ',') as unauth_list, count(mytable2.Numbers) as unauth_count
FROM mytable1
LEFT JOIN mytable2 on mytable1.id = mytable2.id
WHERE (mytable1.id = mytable2.id)
AND (mytable1.Numbers <> mytable2.Numbers)
GROUP BY mytable1.id
Expected output:
(“CRQ44MPX1SZ”, ”1870,1882”, 2)
(“UHO21QQY3TW”, ”4350”, 1)
Just left join the two tables on ID and numbers and check for (where clause) to see if the match wasn't found. Shouldn't be a need for listagg() and complex comparing. Or did I miss part of the question?

Get data from one table with nested relations

I am new in DB and I have a table topics and in this table, I have a foreign key master_topic_id and this foreign key is related to the same table topics column id.
Schema:
CREATE TABLE public.topics (
id bigserial NOT NULL,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
published_at timestamp NULL,
master_topic_id int8 NULL,
CONSTRAINT t_pkey PRIMARY KEY (id),
CONSTRAINT t_master_topic_id_fkey FOREIGN KEY (master_topic_id) REFERENCES topics(id
);
I write a query - SELECT * FROM topics WHERE id = 10. But if this record has master_topic_id I need to get data by master_topic_id too.
I tried to do it by using JOIN, but join just concat records, but I need to have data from master_topic_id as new row.
Any help?
I think you are describing:
select t.*
from topics t
where t.id = 10 or
exists (select 1
from topics t2
where t2.master_topic_id = t.id and t2.id = 10
);
However, you might just want:
where 10 in (id, master_topic_id)
Use or in your where condition
SELECT *
FROM topics
WHERE id = 10
or master_topic_id = 10
you can use union all as well
SELECT *
FROM topics
WHERE id = 10
union all
SELECT *
FROM topics
WHERE master_topic_id = 10

SQL- How to select data from two different table?

I am working to select data from 2 different table but I can't figured out. If I use INNER JOIN it show noting. Any help are welcome and Thanks.
My First table:
CREATE TABLE P_N(
PN_ID int NOT NULL,
PN VARCHAR (1000),
primary key (PN_ID)
);
My second Table:
CREATE TABLE NAME (
NAME_ID VARCHAR(60) PRIMARY key,
NAME VARCHAR (40)
);
My select code :
SELECT DISTINCT NAME.NAME_ID, PN.PN_ID
FROM NAME
FULL JOIN P_N
ON PN.PN =NAME.NAME_ID;
If I use left or full Join this is the result:
NAME_ID PN_ID
nm0006300 NULL
nm0006400 NULL
nm0006500 NULL
nm0006600 NULL
nm0006700 NULL
AND if I use right join:
NAME_ID PN_ID
null 921691
null 921692
null 921693
null 921694
This is what I want the result to looks like For example:
NAME_ID PN_ID
nm0006300 921691
nm0006400 921692
nm0006500 921693
nm0006600 921694
You don't seem to have a JOIN key. You can add one with ROW_NUMBER():
SELECT n.NAME_ID, PN.PN_ID
FROM (SELECT n.*, ROW_NUMBER() OVER (ORDER BY NAME_ID) as seqnum
FROM NAME n
) n JOIN
(SELECT pn.*, ROW_NUMBER() OVER (ORDER BY PN) as seqnum
FROM P_N pn
) pn
ON PN.seqnum = n.seqnum;
try this
select DISTINCT NAME.NAME_ID, PN.PN_ID
from NAME,P_N as PN
where PN.PN =NAME.NAME_ID

SQLITE3: converting IDs to codes when there are hundreds of columns to convert

I have a table A that has several hundred columns (let's say 301 for example) with the first column being the primary key and the rest being IDs from table B i.e.
CREATE TABLE "A" (
ko_index_id INTEGER NOT NULL,
ko1 INTEGER NOT NULL,
ko2 INTEGER NOT NULL,
...
ko300 INTEGER NOT NULL,
PRIMARY KEY (ko_index_id)
);
CREATE TABLE "B" (
id INTEGER NOT NULL,
name INTEGER NOT NULL,
PRIMARY KEY (id)
);
I would like to be able to convert the IDs into names. For example:
SELECT name FROM B WHERE id in (SELECT * FROM A);
Except the SELECT * part means that ko_index_id will be fed into B which is wrong. If there were only two columns in A I could just write
SELECT name FROM B WHERE id in (SELECT ko1, ko2 FROM A);
but table A has 300 columns!
Can anyone help me around this?
300+ columns? How about redoing table A by pivoting those columns into rows. You could have key name and value column. For example:
select * from A:
id, ko_name, ko_value
1, ko1, 5
1, ko2, 6
Then selecting the keys becomes much easier; e.g:
SELECT name FROM B WHERE id in (SELECT ko_value FROM A where ko_name in ('ko1', 'ko2')) ;
I agree with #Gordon's comment. If you can afford to change your data model, I would suggest you use an intersection table. It's the typical way to model "many-to-many" relationships in a database.
Example:
CREATE TABLE "A" (
id INTEGER NOT NULL,
...
PRIMARY KEY (id)
);
CREATE TABLE "B" (
id INTEGER NOT NULL,
name INTEGER NOT NULL,
...
PRIMARY KEY (id)
);
CREATE TABLE "AB" (
a_id INTEGER NOT NULL,
b_id INTEGER NOT NULL
);
SELECT A.id, B.name
FROM A
INNER JOIN AB ON A.id=AB.a_id
INNER JOIN B ON AB.b_id=B.id;

SQL : Create a full record from 2 tables

I've got a DB structure as is (simplified to maximum for understanding concern):
Table "entry" ("id" integer primary key)
Table "fields" ("name" varchar primary key, and others)
Table "entry_fields" ("entryid" integer primary key, "name" varchar primary key, "value")
I would like to get, for a given "entry.id", the detail of this entry, ie. all the "entry_fields" linked to this entry, in a single SQL query.
An example would be better perhaps:
"fields":
"result"
"output"
"code"
"command"
"entry" contains:
id : 842
id : 850
"entry_fields" contains:
entryid : 842, name : "result", value : "ok"
entryid : 842, name : "output", value : "this is an output"
entryid : 842, name : "code", value : "42"
entryid : 850, name : "result", value : "ko"
entryid : 850, name : "command", value : "print ko"
The wanted output would be:
| id | command | output | code | result |
| 842 | NULL | "this is an output" | 42 | ok |
| 850 | "print ko" | NULL | NULL | ko |
The aim is to be able to add a "field" without changing anything to "entry" table structure
I tried something like:
SELECT e.*, (SELECT name FROM fields) FROM entry AS e
but Postgres complains:
ERROR: more than one row returned by a subquery used as an expression
Hope someone can help me!
Solution as requested
While stuck with this unfortunate design, the fastest query would be with crosstab(), provided by the additional module tablefunc. Ample details in this related answer:
PostgreSQL Crosstab Query
For the question asked:
SELECT * FROM crosstab(
$$SELECT e.id, ef.name, ef.value
FROM entry e
LEFT JOIN entry_fields ef
ON ef.entryid = e.id
AND ef.name = ANY ('{result,output,code,command}'::text[])
ORDER BY 1, 2$$
,$$SELECT unnest('{result,output,code,command}'::text[])$$
) AS ct (id int, result text, output text, code text, command text);
Database design
If you don't have a huge number of different fields, it will be much simpler and more efficient to merge all three tables into one simple table:
CREATE TABLE entry (
entry_id serial PRIMARY KEY
,field1 text
,field2 text
, ... more fields
);
Fields without values can be NULL. NULL storage is very cheap (basically 1 bit per column in the NULL bitmap):
How much disk-space is needed to store a NULL value using postgresql DB?
Do nullable columns occupy additional space in PostgreSQL?
Even if you have hundreds of different columns, and only few are filled per entry, this will still use much less disk space.
You query becomes trivial:
SELECT entry_id, result, output, code, command
FROM enty;
If you have too many columns1, and that's not just a misguided design (often, this can be folded into much fewer columns), consider the data types hstore or json / jsonb (in Postgres 9.4) for EAV storage.
1 Per Postgres "About" page:
Maximum Columns per Table 250 - 1600 depending on column types
Consider this related answer with alternatives:
Use case for hstore against multiple columns
And this question about typical use cases / problems of EAV structures on dba.SE:
Is there a name for this database structure?
Dynamic SQL:
CREATE TABLE fields (name varchar(100) PRIMARY KEY)
INSERT INTO FIELDS VALUES ('RESULT')
INSERT INTO FIELDS VALUES ('OUTPUT')
INSERT INTO FIELDS VALUES ('CODE')
INSERT INTO FIELDS VALUES ('COMMAND')
CREATE TABLE ENTRY_fields (ENTRYID INT, name varchar(100), VALUE VARCHAR(100) CONSTRAINT PK PRIMARY KEY(ENTRYID, name))
INSERT INTO ENTRY_fields VALUES(842, 'RESULT', 'OK')
INSERT INTO ENTRY_fields VALUES(842, 'OUTPUT', 'THIS IS AN OUTPUT')
INSERT INTO ENTRY_fields VALUES(842, 'CODE', '42')
INSERT INTO ENTRY_fields VALUES(850, 'RESULT', 'KO')
INSERT INTO ENTRY_fields VALUES(850, 'COMMAND', 'PRINT KO')
CREATE TABLE ENTRY (ID INT PRIMARY KEY)
INSERT INTO ENTRY VALUES(842)
INSERT INTO ENTRY VALUES(850)
DECLARE #COLS NVARCHAR(MAX), #SQL NVARCHAR(MAX)
select #Cols = stuff((select ', ' + quotename(dt)
from (select DISTINCT name as dt
from fields) X
FOR XML PATH('')),1,2,'')
PRINT #COLS
SET #SQL = 'SELECT * FROM (SELECT id, f.name, value
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ef.ENTRYID
) Y PIVOT (max(value) for name in ('+ #Cols +'))PVT '
--print #SQL
exec (#SQL)
If you think your values are going to be constant in the fields table:
SELECT * FROM (SELECT id, f.name ,value
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ef.ENTRYID
) Y PIVOT (max(value) for name in ([CODE], [COMMAND], [OUTPUT], [RESULT]))PVT
Query that may work with postgresql:
SELECT ID, MAX(CODE) as CODE, MAX(COMMAND) as COMMAND, MAX(OUTPUT) as OUTPUT, MAX(RESULT) as RESULT
FROM (SELECT ID,
CASE WHEN f.name = 'CODE' THEN VALUE END AS CODE,
CASE WHEN f.name = 'COMMAND' THEN VALUE END AS COMMAND,
CASE WHEN f.name = 'OUTPUT' THEN VALUE END AS OUTPUT,
CASE WHEN f.name = 'RESULT' THEN VALUE END AS RESULT
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ENTRYID
) Y
GROUP BY ID
The subquery (SELECT name FROM fields) would return 4 rows. You can't stuff 4 rows into 1 in SQL. You can use crosstab, which I'm not familiar enough to answer. Or you can use a crude query like this:
SELECT e.*,
(SELECT value FROM entry_fields AS ef WHERE name = 'command' AND ef.entryid = f.entryid) AS command,
(SELECT value FROM entry_fields AS ef WHERE name = 'output' AND ef.entryid = f.entryid) AS output,
(SELECT value FROM entry_fields AS ef WHERE name = 'code' AND ef.entryid = f.entryid) AS code,
(SELECT value FROM entry_fields AS ef WHERE name = 'result' AND ef.entryid = f.entryid) AS result
FROM entry AS e