I've got a DB structure as is (simplified to maximum for understanding concern):
Table "entry" ("id" integer primary key)
Table "fields" ("name" varchar primary key, and others)
Table "entry_fields" ("entryid" integer primary key, "name" varchar primary key, "value")
I would like to get, for a given "entry.id", the detail of this entry, ie. all the "entry_fields" linked to this entry, in a single SQL query.
An example would be better perhaps:
"fields":
"result"
"output"
"code"
"command"
"entry" contains:
id : 842
id : 850
"entry_fields" contains:
entryid : 842, name : "result", value : "ok"
entryid : 842, name : "output", value : "this is an output"
entryid : 842, name : "code", value : "42"
entryid : 850, name : "result", value : "ko"
entryid : 850, name : "command", value : "print ko"
The wanted output would be:
| id | command | output | code | result |
| 842 | NULL | "this is an output" | 42 | ok |
| 850 | "print ko" | NULL | NULL | ko |
The aim is to be able to add a "field" without changing anything to "entry" table structure
I tried something like:
SELECT e.*, (SELECT name FROM fields) FROM entry AS e
but Postgres complains:
ERROR: more than one row returned by a subquery used as an expression
Hope someone can help me!
Solution as requested
While stuck with this unfortunate design, the fastest query would be with crosstab(), provided by the additional module tablefunc. Ample details in this related answer:
PostgreSQL Crosstab Query
For the question asked:
SELECT * FROM crosstab(
$$SELECT e.id, ef.name, ef.value
FROM entry e
LEFT JOIN entry_fields ef
ON ef.entryid = e.id
AND ef.name = ANY ('{result,output,code,command}'::text[])
ORDER BY 1, 2$$
,$$SELECT unnest('{result,output,code,command}'::text[])$$
) AS ct (id int, result text, output text, code text, command text);
Database design
If you don't have a huge number of different fields, it will be much simpler and more efficient to merge all three tables into one simple table:
CREATE TABLE entry (
entry_id serial PRIMARY KEY
,field1 text
,field2 text
, ... more fields
);
Fields without values can be NULL. NULL storage is very cheap (basically 1 bit per column in the NULL bitmap):
How much disk-space is needed to store a NULL value using postgresql DB?
Do nullable columns occupy additional space in PostgreSQL?
Even if you have hundreds of different columns, and only few are filled per entry, this will still use much less disk space.
You query becomes trivial:
SELECT entry_id, result, output, code, command
FROM enty;
If you have too many columns1, and that's not just a misguided design (often, this can be folded into much fewer columns), consider the data types hstore or json / jsonb (in Postgres 9.4) for EAV storage.
1 Per Postgres "About" page:
Maximum Columns per Table 250 - 1600 depending on column types
Consider this related answer with alternatives:
Use case for hstore against multiple columns
And this question about typical use cases / problems of EAV structures on dba.SE:
Is there a name for this database structure?
Dynamic SQL:
CREATE TABLE fields (name varchar(100) PRIMARY KEY)
INSERT INTO FIELDS VALUES ('RESULT')
INSERT INTO FIELDS VALUES ('OUTPUT')
INSERT INTO FIELDS VALUES ('CODE')
INSERT INTO FIELDS VALUES ('COMMAND')
CREATE TABLE ENTRY_fields (ENTRYID INT, name varchar(100), VALUE VARCHAR(100) CONSTRAINT PK PRIMARY KEY(ENTRYID, name))
INSERT INTO ENTRY_fields VALUES(842, 'RESULT', 'OK')
INSERT INTO ENTRY_fields VALUES(842, 'OUTPUT', 'THIS IS AN OUTPUT')
INSERT INTO ENTRY_fields VALUES(842, 'CODE', '42')
INSERT INTO ENTRY_fields VALUES(850, 'RESULT', 'KO')
INSERT INTO ENTRY_fields VALUES(850, 'COMMAND', 'PRINT KO')
CREATE TABLE ENTRY (ID INT PRIMARY KEY)
INSERT INTO ENTRY VALUES(842)
INSERT INTO ENTRY VALUES(850)
DECLARE #COLS NVARCHAR(MAX), #SQL NVARCHAR(MAX)
select #Cols = stuff((select ', ' + quotename(dt)
from (select DISTINCT name as dt
from fields) X
FOR XML PATH('')),1,2,'')
PRINT #COLS
SET #SQL = 'SELECT * FROM (SELECT id, f.name, value
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ef.ENTRYID
) Y PIVOT (max(value) for name in ('+ #Cols +'))PVT '
--print #SQL
exec (#SQL)
If you think your values are going to be constant in the fields table:
SELECT * FROM (SELECT id, f.name ,value
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ef.ENTRYID
) Y PIVOT (max(value) for name in ([CODE], [COMMAND], [OUTPUT], [RESULT]))PVT
Query that may work with postgresql:
SELECT ID, MAX(CODE) as CODE, MAX(COMMAND) as COMMAND, MAX(OUTPUT) as OUTPUT, MAX(RESULT) as RESULT
FROM (SELECT ID,
CASE WHEN f.name = 'CODE' THEN VALUE END AS CODE,
CASE WHEN f.name = 'COMMAND' THEN VALUE END AS COMMAND,
CASE WHEN f.name = 'OUTPUT' THEN VALUE END AS OUTPUT,
CASE WHEN f.name = 'RESULT' THEN VALUE END AS RESULT
from fields F CROSS join ENTRY LEFT JOIN entry_fields ef on ef.name = f.name AND ID = ENTRYID
) Y
GROUP BY ID
The subquery (SELECT name FROM fields) would return 4 rows. You can't stuff 4 rows into 1 in SQL. You can use crosstab, which I'm not familiar enough to answer. Or you can use a crude query like this:
SELECT e.*,
(SELECT value FROM entry_fields AS ef WHERE name = 'command' AND ef.entryid = f.entryid) AS command,
(SELECT value FROM entry_fields AS ef WHERE name = 'output' AND ef.entryid = f.entryid) AS output,
(SELECT value FROM entry_fields AS ef WHERE name = 'code' AND ef.entryid = f.entryid) AS code,
(SELECT value FROM entry_fields AS ef WHERE name = 'result' AND ef.entryid = f.entryid) AS result
FROM entry AS e
Related
Lets say I have som data that looks like
create table test.from
(
id integer primary key,
data json
);
insert into test.from
values
(4, '{ "some_field": 11 }'),
(9, '{ "some_field": 22 }');
create table test.to
(
id integer primary key,
some_field int
);
I would like the rows in the "to" table to have the same id key as the "from" row, and expand the json into separate columns. But using json_populate_record like below, will unsurprisingly give me null as key.
Method 1:
insert into test.to
select l.*
from test.from fr
cross join lateral json_populate_record(null::test.to, fr.data) l;
I can achieve what I'm looking for by naming columns like below
Method 2:
insert into test.to (id, some_field)
select
fr.id as id,
l.some_field
from test.from fr
cross join lateral json_populate_record(null::test.to, fr.data) l;
The challenge is that I want to avoid naming any columns other than the id column, both since it gets tedious, but also since I'd like to do this in a function where the column names are not known.
What modifications do I have to do to Method 1 to update the record with the correct id?
Just append the id key to your data like this:
insert into test.to
select l.*
from
test.from fr
cross join lateral jsonb_populate_record(
null::test.to,
fr.data::jsonb || jsonb_build_object('id', fr.id)) l;
my query:
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE NOT EXISTS (
SELECT *
FROM ncbi.affi_known1 AS b
WHERE a.id = b.id
)
limit 5000
it returns:
id
affiliation
4683763
Psychopharmacology Unit, Dorothy Hodgkin Building, University of Bristol, Whitson Street, Bristol, BS1 3NY, UK.
as first row.
but
select * from ncbi.affi_known1 where id = 4683763
do return the data with id = 4683763
both id are int8 type
table a
CREATE TABLE "public"."affiliation" (
"id" int8 NOT NULL,
"affiliation" text COLLATE "pg_catalog"."default",
"tsv_affiliation" tsvector,
CONSTRAINT "affiliation_pkey" PRIMARY KEY ("id")
)
;
CREATE INDEX "affi_idx_tsv" ON "public"."affiliation" USING gin (
to_tsvector('english'::regconfig, affiliation) "pg_catalog"."tsvector_ops"
);
CREATE INDEX "tsv_affiliation_idx" ON "public"."affiliation" USING gin (
"tsv_affiliation" "pg_catalog"."tsvector_ops"
);
table b
CREATE TABLE "ncbi"."affi_known1" (
"id" int8 NOT NULL,
"affi_raw" text COLLATE "pg_catalog"."default",
"affi_main" text COLLATE "pg_catalog"."default",
"affi_known" bool,
"divide" text COLLATE "pg_catalog"."default",
"divide_known" bool,
"sub_divides" text[] COLLATE "pg_catalog"."default",
"country" text COLLATE "pg_catalog"."default",
CONSTRAINT "affi_known_pkey" PRIMARY KEY ("id")
)
;
update:
after create index on id, everything works well.
delete the index, it seems go wrong.
so why primary key id fails there.
update2:
table b is generated from table a, using:
query = '''
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE NOT EXISTS (
SELECT 1
FROM ncbi.affi_known AS b
WHERE a.id = b.id
)
limit 2000000
'''
data = pd.read_sql(query,conn)
while len(data):
for i,row in tqdm(data.iterrows()):
...
curser_insert.execute(
'insert into ncbi.affi_known(id,affi_raw, affi_main ,affi_known,divide,country) values ( %s, %s, %s,%s,%s,%s) ',
[affi_id,affi_raw, affi_main, affi_known,devide,country]
)
conn2.commit()
conn2.commit()
conn.commit()
data = pd.read_sql(query, conn)
and the code exit improperly.
Your understanding of how EXISTS works might be off. Your current exists query is saying that id 4683763 exists in the affiliation table, not the affi_known1 table. So, the following query should return the single record:
SELECT a.id, a.affiliation
FROM public.affiliation a
WHERE a.id = 4683763;
I am assuming the requirement is to fetch rows only when the id is not present in the second table, so you can try this
select a.id, a.affiliation
FROM public.affiliation AS a
WHERE a.id NOT IN (
SELECT id
FROM ncbi.affi_known1
)
If id were an integer, your query would do what you want.
If id is a string, you could have issues with "look-alikes". It is very hard to say what the problem is -- there could be spaces in the id, hidden characters, or something else. And this could be in either table.
Assuming the ids look like numbers, you could filter "bad" ids out using regular expressions:
select id
from ncbi.affi_known1
where not id ~ '^[0-9]*$';
I have 2 tables as such
Table ErrorCodes:
type_code desc
01 Error101
02 Error99
03 Error120
Table ErrorXML:
row_index typeCode
1 87
2 02
3 01
The output should be the description(column desc) of the first matched type_code between the 2 tables
Expected output : Error99
I have gotten so far.
select isnull(descript, 'unknown') as DESCRIPTION
from (select top 1 a.stmt_cd as descript
from ErrorCodes a, ErrorXML b
where a.type_cd = b.typecode
order by b.row_index)
But this query doesn't return the string UNKNOWN when there is no common typecode (join condition) between the 2 tables. In this case, im getting null.
How can I resolve this?
This is an interesting question. I believe the following can be an intuitive and beautiful solution (I used desc_ as column name rather than desc which is a reserved word):
select (select desc_ from ErrorCodes x where x.type_code = a.typeCode) desc_
from ErrorXML a
where (select desc_ from ErrorCodes x where x.type_code = a.typeCode) is not null
order by row_index
limit 1;
If you also need to handle the case if query returns no row then for MySQL, following syntax should suffice. For other databases you can use similar encapsulation with isnull, nvl, etc:
select ifnull((select (select desc_ from ErrorCodes x where x.type_code = a.typeCode) desc_ from ErrorXML a where (select desc_ from ErrorCodes x where x.type_code = a.typeCode) is not null order by row_index limit 1), 'UNKNOWN');
To test I used following scripts and seems to work properly:
create database if not exists stackoverflow;
use stackoverflow;
drop table if exists ErrorCodes;
create table ErrorCodes
(
type_code varchar(2),
desc_ varchar(10)
);
insert into ErrorCodes(type_code, desc_) values
('01', 'Error101'),
('02', 'Error99'),
('03', 'Error120');
drop table if exists ErrorXML;
create table ErrorXML
(
row_index integer,
typeCode varchar(2)
);
insert into ErrorXML(row_index, typeCode) values
('1', '87'),
('2', '02'),
('3', '01');
Final-1 quote: While generating your tables try to use same column names as much as possible. I.e. I'd suggest ErrorXML to use type_code rather than typeCode.
Final quote: I choose to use lower letters in SQL since capital letters should be used while emphasizing an important point. I also suggest that style.
What about this: Do a subquery to bring back the first row_index for each type_code.
Do a LEFT OUTER Join on the ErrorCodes table so that you get NULLs as well.
SELECT
ISNULL(ErrorCodes.desc,'unknown') AS description
ErrorXML.row_index
FROM ErrorCodes
LEFT OUTER JOIN (
SELECT type_code, MIN(row_index) AS row_index
FROM ErrorXML
GROUP BY type_code
) AS ErrorXML ON ErrorCodes.type_code = ErrorXML .type_code
I have 2 tables and 1 junction table:
table 1 (Log): | Id | Title | Date | ...
table 2 (Category): | Id | Title | ...
junction table between table 1 and 2:
LogCategory: | Id | LogId | CategoryId
now, I want a sql query to get all logs with all categories title in one field,
something like this:
LogId, LogTitle, ..., Categories(that contains all category title assigned to this log id)
can any one help me solve this? thanks
Try this code:
DECLARE #results TABLE
(
idLog int,
LogTitle varchar(20),
idCategory int,
CategoryTitle varchar(20)
)
INSERT INTO #results
SELECT l.idLog, l.LogTitle, c.idCategory, c.CategoryTitle
FROM
LogCategory lc
INNER JOIN Log l
ON lc.IdLog = l.IdLog
INNER JOIN Category c
ON lc.IdCategory = c.IdCategory
SELECT DISTINCT
idLog,
LogTitle,
STUFF (
(SELECT ', ' + r1.CategoryTitle
FROM #results r1
WHERE r1.idLog = r2.idLog
ORDER BY r1.idLog
FOR XML PATH ('')
), 1, 2, '')
FROM
#results r2
Here you have a simple SQL Fiddle example
I'm sure this query can be written using only one select, but this way it is readable and I can explain what the code does.
The first select takes all Log - Category matches into a table variable.
The second part uses FOR XML to select the category names and return the result in an XML instead of in a table. by using FOR XML PATH ('') and placing a ', ' in the select, all the XML tags are removed from the result.
And finally, the STUFF instruction replaces the initial ', ' characters of every row and writes an empty string instead, this way the string formatting is correct.
I have an Access table of the form (I'm simplifying it a bit)
ID AutoNumber Primary Key
SchemeName Text (50)
SchemeNumber Text (15)
This contains some data eg...
ID SchemeName SchemeNumber
--------------------------------------------------------------------
714 Malcolm ABC123
80 Malcolm ABC123
96 Malcolms Scheme ABC123
101 Malcolms Scheme ABC123
98 Malcolms Scheme DEF888
654 Another Scheme BAR876
543 Whatever Scheme KJL111
etc...
Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).
I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.
See if this query returns the rows you want to keep:
SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
(SELECT
SchemeNumber,
SchemeName,
Len(SchemeName) AS name_length,
ID
FROM tblSchemes
) AS r
INNER JOIN
(SELECT
SchemeNumber,
Max(Len(SchemeName)) AS name_length
FROM tblSchemes
GROUP BY SchemeNumber
) AS w
ON
(r.SchemeNumber = w.SchemeNumber)
AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;
If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.
DELETE
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);
Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:
DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));
DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
WHERE t1.SchemeNumber = t2.SchemeNumber
AND Length(t2.SchemeName) > Length(t1.SchemeName)
)
Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)
delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))
(SQL Server flavored)
Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.
I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.
-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
id int(11) NOT NULL,
SchemeName varchar(50),
SchemeNumber varchar(15),
PRIMARY KEY (id)
);
insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );
-- Find all the records that have dups, find the longest one
create table bar as
select max(length(SchemeName)) as max_length, SchemeNumber
from foo
group by SchemeNumber
having count(*) > 1;
-- Find the one we want to keep
create table bat as
select min(a.id) as id, a.SchemeNumber
from foo a join bar b on a.SchemeNumber = b.SchemeNumber
and length(a.SchemeName) = b.max_length
group by SchemeNumber;
-- Select into this table all the rows to delete
create table baz as
select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber
and a.id != b.id;
This will give you a new table with only records for rows that you want to remove.
Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.
Then when you're ready, use this command to delete the rows using this command.
delete from foo where id in (select id from baz);
This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.
If your platform supports ranking functions and common table expressions:
with cte as (
select row_number()
over (partition by SchemeNumber order by len(SchemeName) desc) as rn
from Table)
delete from cte where rn > 1;
try this:
Select * From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or this:,...
Select * From Table t
Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))
if either of these selects the records that should be deleted, just change it to a delete
Delete
From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or using the second construction:
Delete From Table t Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))