SQL JOIN To Find Records That Don't Have a Matching Record With a Specific Value - sql

I'm trying to speed up some code that I wrote years ago for my employer's purchase authorization app. Basically I have a SLOW subquery that I'd like to replace with a JOIN (if it's faster).
When the director logs into the application he sees a list of purchase requests he has yet to authorize or deny. That list is generated with the following query:
SELECT * FROM SA_ORDER WHERE ORDER_ID NOT IN
(SELECT ORDER_ID FROM SA_SIGNATURES WHERE TYPE = 'administrative director');
There are only about 900 records in sa_order and 1800 records in sa_signature and this query still takes about 5 seconds to execute. I've tried using a LEFT JOIN to retrieve records I need, but I've only been able to get sa_order records with NO matching records in sa_signature, and I need sa_order records with "no matching records with a type of 'administrative director'". Your help is greatly appreciated!
The schema for the two tables is as follows:
The tables involved have the following layout:
CREATE TABLE sa_order
(
`order_id` BIGINT PRIMARY KEY AUTO_INCREMENT,
`order_number` BIGINT NOT NULL,
`submit_date` DATE NOT NULL,
`vendor_id` BIGINT NOT NULL,
`DENIED` BOOLEAN NOT NULL DEFAULT FALSE,
`MEMO` MEDIUMTEXT,
`year_id` BIGINT NOT NULL,
`advisor` VARCHAR(255) NOT NULL,
`deleted` BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE sa_signature
(
`signature_id` BIGINT PRIMARY KEY AUTO_INCREMENT,
`order_id` BIGINT NOT NULL,
`signature` VARCHAR(255) NOT NULL,
`proxy` BOOLEAN NOT NULL DEFAULT FALSE,
`timestamp` TIMESTAMP NOT NULL DEFAULT NOW(),
`username` VARCHAR(255) NOT NULL,
`type` VARCHAR(255) NOT NULL
);

Create an index on sa_signatures (type, order_id).
This is not necessary to convert the query into a LEFT JOIN unless sa_signatures allows nulls in order_id. With the index, the NOT IN will perform as well. However, just in case you're curious:
SELECT o.*
FROM sa_order o
LEFT JOIN
sa_signatures s
ON s.order_id = o.order_id
AND s.type = 'administrative director'
WHERE s.type IS NULL
You should pick a NOT NULL column from sa_signatures for the WHERE clause to perform well.

You could replace the [NOT] IN operator with EXISTS for faster performance.
So you'll have:
SELECT * FROM SA_ORDER WHERE NOT EXISTS
(SELECT ORDER_ID FROM SA_SIGNATURES
WHERE TYPE = 'administrative director'
AND ORDER_ID = SA_ORDER.ORDER_ID);
Reason : "When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, query can use an index within the sub-query."
Source : http://decipherinfosys.wordpress.com/2007/01/21/32/

This following query should work, however I suspect your real issue is you don't have the proper indices in place. You should have an index on the SA_SGINATURES table on the ORDER_ID column.
SELECT *
FROM
SA_ORDER
LEFT JOIN
SA_SIGNATURES
ON
SA_ORDER.ORDER_ID = SA_SIGNATURES.ORDER_ID AND
TYPE = 'administrative director'
WHERE
SA_SIGNATURES.ORDER_ID IS NULL;

select * from sa_order as o inner join sa_signature as s on o.orderid = sa.orderid and sa.type = 'administrative director'
also, you can create a non clustered index on type in sa_signature table
even better - have a master table for types with typeid and typename, and then instead of saving type as text in your sa_signature table, simply save type as integer. thats because computing on integers is way faster than computing on text

Related

How to find the columns that need to be indexed?

I'm starting to learn SQL and relational databases. Below is the table that I have, and it has around 10 million records. My composite key is (reltype, from_product_id, to_product_id).
What strategy should I follow while selecting the columns that needs to be indexed? Also, I have documented the operations that would be performed on the table. Please help in determining which columns or combination of columns that need to be indexed?
Table DDL is shown below.
Table name: prod_rel.
Database schema name : public
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Operations performed on table:
select distinct(reltype )
from public.prod_rel;
update public.prod_rel
set status = ? , starttime = ?
where from_product_id = ?;
update public.prod_rel
set status = ? , endtime = ?
where from_product_id = ?;
select *
from public.prod_rel
where from_product_id in (select distinct (from_product_id)
from public.prod_rel
where status = ?
and action in ('A', 'E', 'C', 'P')
and reltype = ?
fetch first 1000 rows only);
Note: I'm not performing any JOIN operations. Also please ignore the uppercase for table or column names. I'm just getting started.
Ideal would be two indexes:
CREATE INDEX ON prod_rel (from_product_id);
CREATE INDEX ON prod_rel (status, reltype)
WHERE action IN ('A', 'E', 'C', 'P');
Your primary key (which also is implemented using an index) cannot support query 2 and 3 because from_product_id is not in the beginning. If you redefine the primary key as from_product_id, to_product_id, reltype, you don't need the first index I suggested.
Why does order matter? Imagine you are looking for a book in a library where the books are ordered by “last name, first name”. You can use this ordering to find all books by “Dickens” quickly, but not all books by any “Charles”.
But let me also comment on your queries.
The first one will perform badly if there are lots of different reltype values; try raising work_mem in that case. It is always a sequential scan of the whole table, and no index can help.
I have changed the order of primary columns as shown below as per #a_horse_with_no_name 's suggestion and created only one index for (from_product_id, reltype, status, action) columns.
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Also, I have gone thorough the portal suggested by #a_horse_with_no_name. It was amazing. I came to know lot of new things on indexing.
https://use-the-index-luke.com/

Left join returns duplicate rows

I am just learning SQL and I'm really struggling to understand why my left join is returning duplicate rows. This is the query I'm using:
SELECT "id", "title"
FROM "posts"
LEFT JOIN "comments" "comment"
ON "comment"."post_id"="id" AND ("comment"."status" = 'hidden')
It returns 4 rows, but should only return 3. Two of the returned rows contain are duplicate (same values). I can fix this by using the DISTINCT prefix on "id".
SELECT DISTINCT "id", "title"
FROM "posts"
LEFT JOIN "comments" "comment"
ON "comment"."post_id"="id" AND ("comment"."status" = 'hidden')
The query returns 3 rows and I get desired result. But I'm still wondering why in the world I would get a duplicate row from the first query in the first place? I'm trying to write an aggregation query and this seems to be the issue I'm having.
I'm using PostgreSQL.
More specific: (as created by my ORM)
Shift DDL
CREATE TABLE shift (
id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
"gigId" uuid REFERENCES gig(id) ON DELETE CASCADE,
"categoryId" uuid REFERENCES category(id),
notes text,
"createdAt" timestamp without time zone NOT NULL DEFAULT now(),
"updatedAt" timestamp without time zone NOT NULL DEFAULT now(),
"salaryFixed" numeric,
"salaryHourly" numeric,
"salaryCurrency" character varying(3) DEFAULT 'SEK'::character varying,
"staffingMethod" character varying(255) NOT NULL DEFAULT 'auto'::character varying,
"staffingIspublished" boolean NOT NULL DEFAULT false,
"staffingActivateon" timestamp with time zone,
"staffingTarget" integer NOT NULL DEFAULT 0
);
ShiftEmployee DDL
CREATE TABLE "shiftEmployee" (
"employeeId" uuid REFERENCES employee(id) ON DELETE CASCADE,
"shiftId" uuid REFERENCES shift(id) ON DELETE CASCADE,
status character varying(255) NOT NULL,
"updatedAt" timestamp without time zone NOT NULL DEFAULT now(),
"salaryFixed" numeric,
"salaryHourly" numeric,
"salaryCurrency" character varying(3) DEFAULT 'SEK'::character varying,
CONSTRAINT "PK_6acfd2e8f947cee5a62ebff08a5" PRIMARY KEY ("employeeId", "shiftId")
);
Query
SELECT "id", "staffingTarget" FROM "shift" LEFT JOIN "shiftEmployee" "se" ON "se"."shiftId"="id" AND ("se"."status" = 'confirmed');
Result
id staffingTarget
68bb0892-9bce-4d08-b40e-757cb0889e87 3
12d88ff7-9144-469f-8de5-3e316c4b3bbd 6
73c65656-e028-4f97-b855-43b00f953c7b 5
68bb0892-9bce-4d08-b40e-757cb0889e88 3
e3279b37-2ba5-4f1d-b896-70085f2ba345 4
e3279b37-2ba5-4f1d-b896-70085f2ba346 5
e3279b37-2ba5-4f1d-b896-70085f2ba346 5
789bd2fb-3915-4cda-a3d7-2186cf5bb01a 3
If a post has more than one hidden comment, you will see that post multiple times because a join returns one row for each match - that's the nature of a join. And an outer join doesn't behave differently.
If your intention is to list only posts with hidden comments, it's better to use an EXISTS query instead:
SELECT p.id, p.title
FROM posts p
where exists (select *
from comments c
where c.post_id = p.id
and c.status = 'hidden');

Query optimization: connecting meta data to a value list table

I have a database containing a table with data and a meta data table. I want to create a View that selects certain meta data belonging to an item and list it as a column.
The basic query for the view is: SELECT * FROM item. The item table is defined as:
CREATE TABLE item (
id INTEGER PRIMARY KEY AUTOINCREMENT
UNIQUE
NOT NULL,
traceid INTEGER REFERENCES trace (id)
NOT NULL,
freq BIGINT NOT NULL,
value REAL NOT NULL
);
The meta data to be added follow the schema "metadata.parameter='name'"
The meta table is defined as:
CREATE TABLE metadata (
id INTEGER PRIMARY KEY AUTOINCREMENT
UNIQUE
NOT NULL,
parameter STRING NOT NULL
COLLATE NOCASE,
value STRING NOT NULL
COLLATE NOCASE,
datasetid INTEGER NOT NULL
REFERENCES dataset (id),
traceid INTEGER REFERENCES trace (id),
itemid INTEGER REFERENCES item (id)
);
The "name" parameter should be selected this way:
if a record exists where parameter is "name" and itemid matches item.id, then its value should be included in the record.
otherwise, if a record exists where parameter is "name", "itemid" is NULL, and traceid matches item.traceid, its value should be used
otherwise, the result should be NULL, but the record from the item table should be included anyway
Currently, I use the following query to achieve this goal:
SELECT i.*,
COALESCE (
MAX(CASE WHEN m.parameter='name' THEN m.value END),
MAX(CASE WHEN m2.parameter='name' THEN m2.value END)
) AS itemname
FROM item i
JOIN metadata m
ON (m.itemid = i.id AND m.parameter='name')
JOIN metadata m2
ON (m2.itemid IS NULL AND m2.traceid = i.traceid AND m2.parameter='name')
GROUP BY i.id
This query however is somewhat inefficient, as the metadata table is used twice and contains many more records than just the "name" ones. So I am looking for a way to improve speed, especially regarding the case that some extensions are about to be implemented:
there is a third level "dataset" that should be included: a "parameter=name" should be used if it has the same datasetid as the item (will be looked up for the items by searching another which connects traceid and datasetid), if no "parameter=name" exists with either "itemid" matching or "traceid" matching
more meta data should be queried by the view following the same schema
Any help is appreciated.
First of all, you can use one join instead of 2, like this:
JOIN metadata m ON (m.parameter='name' AND (m.itemId = i.id OR (m.itemId IS NULL AND m.traceid = i.traceid)))
Then you can remove COALESCE, using simple select:
SELECT i.*, m.value as itemname
Result should look like this:
SELECT i.*, m.value as itemname
FROM item i
JOIN metadata m ON (m.parameter='name' AND (m.itemId = i.id OR (m.itemId IS NULL AND m.traceid = i.traceid)))
GROUP BY i.id

Bad SQLite query performance with outer joins

I have an SQLite database as part of an iOS app which works fine for the most part but certain small changes to a query can result in it taking 1000x longer to complete. Here's the 2 tables I have involved:
create table "journey_item" ("id" SERIAL NOT NULL PRIMARY KEY,
"position" INTEGER NOT NULL,
"last_update" BIGINT NOT NULL,
"rank" DOUBLE PRECISION NOT NULL,
"skipped" BOOLEAN NOT NULL,
"item_id" INTEGER NOT NULL,
"journey_id" INTEGER NOT NULL);
create table "content_items" ("id" SERIAL NOT NULL PRIMARY KEY,
"full_id" VARCHAR(32) NOT NULL,
"title" VARCHAR(508),
"timestamp" BIGINT NOT NULL,
"item_size" INTEGER NOT NULL,
"http_link" VARCHAR(254),
"local_url" VARCHAR(254),
"creator_id" INTEGER NOT NULL,
"from_id" INTEGER,"location_id" INTEGER);
Tables have indexes on primary and foreign keys.
And here are 2 queries which give a good example of my problem
SELECT * FROM content_items ci
INNER JOIN journey_item ji ON ji.item_id = ci.id WHERE ji.journey_id = 1
SELECT * FROM content_items ci
LEFT OUTER JOIN journey_item ji ON ji.item_id = ci.id WHERE ji.journey_id = 1
The first query takes 167 ms to complete while the second takes 3.5 minutes and I don't know why the outer join would make such a huge difference.
Edit:
Without the WHERE part the second query only takes 267 ms
The two queries should have the same result set (the where clause turns the left join into an inner join)`. However, SQLite probably doesn't recognize this.
If you have an index on journey_item(journey_id, item_id), then this would be used for the inner join version. However, the second version is probably scanning the first table for the join. An index on journey_item(item_id) would help, but probably still not match the performance of the first query.

Index Guidance for SQL Query

Anyone have guidance on how to approach building indexes for the following query? The query works as expected, but I can't seem to get around full table scans. Working with Oracle 11g.
SELECT v.volume_id
FROM ( SELECT MIN (usv.volume_id) volume_id
FROM user_stage_volume usv
WHERE usv.status = 'NEW'
AND NOT EXISTS
(SELECT 1
FROM user_stage_volume kusv
WHERE kusv.deal_num = usv.deal_num
AND kusv.locked = 'Y')
GROUP BY usv.deal_num, usv.volume_type
ORDER BY MAX (usv.priority) DESC, MIN (usv.last_update) ASC) v
WHERE ROWNUM = 1;
Please request any more info you may need in comments and I'll edit.
Here is the create script for the table. The PK is VOLUME_ID. DEAL_NUM is not unique.
CREATE TABLE ENDUR.USER_STAGE_VOLUME
(
DEAL_NUM NUMBER(38) NOT NULL,
EXTERNAL_ID NUMBER(38) NOT NULL,
VOLUME_TYPE NUMBER(38) NOT NULL,
EXTERNAL_TYPE VARCHAR2(100 BYTE) NOT NULL,
GMT_START DATE NOT NULL,
GMT_END DATE NOT NULL,
VALUE FLOAT(126) NOT NULL,
VOLUME_ID NUMBER(38) NOT NULL,
PRIORITY INTEGER NOT NULL,
STATUS VARCHAR2(100 BYTE) NOT NULL,
LAST_UPDATE DATE NOT NULL,
LOCKED CHAR(1 BYTE) NOT NULL,
RETRY_COUNT INTEGER DEFAULT 0 NOT NULL,
INS_DATE DATE NOT NULL
)
ALTER TABLE ENDUR.USER_STAGE_VOLUME ADD (
PRIMARY KEY
(VOLUME_ID))
An index on (deal_num) would help the subquery greatly. In fact, an index on (deal_num, locked) would allow the subquery to avoid the table itself altogether.
You should expect a full table scan on the main query, as it filters on status which is not indexed (and most likely would not benefit from being indexed, unless 'NEW' is a fairly rare value for status).
I think it's running your inner subquery (inside not exists...) once for every run of the outer subquery.
That will be where performance takes a hit - it will run through all of user_stage_volume for each row in user_stage_volume, which is O(n^2), n being the number of rows in usv.
An alternative would be to create a view for the inner subquery, and use that view, or alternatively, to name a temporary view by using WITH.