Refactoring SELECT with two correlated subqueries

Refactoring SELECT with two correlated subqueries - sql

Using Oracle 10g, I have a table that looks like this (syntax shortened for brevity):
CREATE TABLE "BUZINESS"."CATALOG"
( "ID" NUMBER PRIMARY KEY,
"ENTRY_ID" VARCHAR2(40 BYTE) NOT NULL,
"MSG_ID" VARCHAR2(40 BYTE) NOT NULL,
"PUBLISH_STATUS" VARCHAR2(30 BYTE) NOT NULL, /* Can be NEW or PUBLISHED */
CONSTRAINT "CATALOG_UN1" UNIQUE (ENTRY_ID, MSG_ID)
)
One process, Process A, writes Catalog entries with a PUBLISH_STATUS of 'NEW'. A second process, Process B, then comes in, grabs all 'NEW' messages, and then changes the PUBLISH_STATUS to 'PUBLISHED'.
I need to write a query that will grab all PUBLISH_STATUS='NEW' rows, BUT
I'm trying to prevent an out of order fetch, so that if Process B marks a row as PUBLISH_STATUS='PUBLISHED' with MSG_ID '1000', and then Process A writes an out of order row as PUBLISH_STATUS='NEW' with MSG_ID '999', the query will never fetch that row when grabbing all 'NEW' rows.
So, if I start with the data:
INSERT INTO BUZINESS.CATALOG VALUES (1, '1000', '999', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (2, '1000', '1000', 'PUBLISHED');
INSERT INTO BUZINESS.CATALOG VALUES (3, '1000', '1001', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (4, '2000', '1999', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (5, '2000', '2000', 'PUBLISHED');
INSERT INTO BUZINESS.CATALOG VALUES (6, '2000', '2001', 'NEW');
INSERT INTO BUZINESS.CATALOG VALUES (7, '3000', '3001', 'NEW');
Then my query should grab only rows with ID:
3, 6, 7
I then have to join these rows with other data, so the result needs to be JOINable.
So far, I have a very large, ugly query UNIONing two correlated subqueries to do this. Could someone help me write a better query?

Requiring non-presence of joinable data is best solved with an outer join that filters out matching joins (leaving just the non-matches).
In your case, the join condition is a "published" row for the same entry with a later (higher) message if.
This query produces your desired output:
select t1.*
from buziness_catalog t1
left join buziness_catalog t2
on t2.entry_id = t1.entry_id
and to_number(t2.msg_id) > to_number(t1.msg_id)
and t2.publish_status = 'PUBLISHED'
where t1.publish_status = 'NEW'
and t2.id is null
order by t1.id
See live demo of this query working with your sample data to produce the your desired output. Note that is used a table name of "buziness_catalog" rather than "buziness.catalog" so the demo would run - you'll have to change the underscores back to dots.
Being a join, and not based on an exists correlated subquery, this will perform quite well.
This query would have been a little simpler had your msg_id column been a numeric type (the conversion from character to numeric would not have been needed). If your ID data is actually numeric, consider changing the datatype of entry_id and msg_id to a numeric type.

Reading between the lines, I think this might work:
select
*
from
buziness.catalog b1
where
b1.publish_status = 'NEW' and
not exists (
select
'x'
from
buziness.catalog b2
where
b1.entry_id = b2.entry_id and
b2.publish_status = 'PUBLISHED' and
to_number(b2.msg_id) > to_number(b1.msg_id) -- store numbers as numbers!
);

#Laurence 's query looks good, but just to satisfy my curiosity, do you mind EXPLAINing this query too?
I think that those numbers stored as varchar will kill your index usage capabilities when in TO_NUMBER(), but I'm not sure about Oracle, so you better check that.
In case they do, you can always add additional number columns that you update with a trigger when rows are edited — so that you don't break the original design.
SELECT *
FROM buziness b1
WHERE PUBLISH_STATUS = 'NEW'
AND TO_NUMBER(msg_id) > COALESCE((
SELECT MAX(TO_NUMBER(msg_id))
FROM buziness b2
WHERE PUBLISH_STATUS = 'PUBLISHED'
AND b2.entry_id = b1.entry_id
), 0)

Although this is a very old post i still feel the need to reply here as i suspect this is based on misconception/misunderstanding. Oracle like many other RDBMSses still holds to the principles of ACID where the I stands for Isolation. No process x will see the result of another process y before y committed and x started after y. So one proces alterering the view of another proces on the data is not possible.
If not convinced run the query that updates and dont commit. Start another session and query the data again and again until the first query changes it. It will never change for the other sessions until you commit your changes in the first session and you will read the snapshot of the data in it's state it was when you started the query before the other process committed it for all to see.

Related

SQL Count: erratic behaviour

A piece of SQL that I have written is not behaving as intended. A vital piece of logic involves counting how many guests are VIPs, but the SQL seems to consistently get an incorrect answer.
The following database has 6 guests, 3 of whom are VIPs.
CREATE TABLE `guest` (
`GuestID` int(11) NOT NULL DEFAULT '0',
`fullname` varchar(255) DEFAULT NULL,
`vip` tinyint(1) DEFAULT '0',
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `guest`
--
INSERT INTO `guest` (`GuestID`, `fullname`, `vip`) VALUES
(912, 'Sam', 0),
(321, 'Sev', 0),
(629, 'Joe', 0),
(103, 'Tom', 1),
(331, 'Cao', 1),
(526, 'Conor', 1);
Initially the SQL returned a value saying that there were 5 VIPs, which is incorrect as there are only 3 VIPs. This is quite a complicated database, and in generating a minimum viable example for the sake of this question (with a reproducible error) the script now states that there are only 2 VIPs. Again, this is incorrect.
The SQL in question is
SELECT slotguest.FK_SlotNo, Count(CASE WHEN guest.vip = 1 THEN 1 END) AS guest_count
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY slotguest.FK_SlotNo;
The slotguest structure and content is as follows
CREATE TABLE `slotguest` (
`FK_SlotNo` int(11) NOT NULL,
`FK_guest` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `slotguest`
--
INSERT INTO `slotguest` (`FK_SlotNo`, `FK_guest`) VALUES
(396, 912),
(396, 321),
(396, 629),
(396, 103),
(396, 331),
(396, 526);
What is causing Count to come up with a consistently incorrect answer?

As identicated in the comments (check from users #Fábio Amorim, #Rajat), your query seems to work as intended. Since, you set a value with the CASE WHEN, it might be better to use SUM.
It might be more visible if you bring the counts for the different VIP categories to find where there might be a leakage of data.
SELECT guest.vip, slotguest.FK_SlotNo, COUNT(*) AS guest_per_category
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY guest.vip,slotguest.FK_SlotNo;

Smells like "explode-implode". Given
SELECT ... COUNT(*)
FROM a JOIN b ...
GROUP BY ...
The query is performed thus:
JOIN the tables. Assuming the tables are not trivially 1:1, this leads to more rows than either of the tables.
Do aggregates (such as COUNT) against that temp table.
Only then does the GROUP BY shrink back to the originally desired number of rows.
The solution is to avoid doing aggregates with more than the one table that contains the data being counted/summed. Sometimes the pattern is
SELECT ...
FROM ( SELECT x, COUNT(*) AS ct FROM a GROUP BY x ) AS b
JOIN c ON ...

To explain what's wrong, and give an answer closer to the O.P.'s query ...
(I assume the O.P. is a cut-down example of what's going wrong, and the actual query is more complex. If we knew the bigger picture, I suspect I wouldn't code it like that.)
In the O.P. query, CASE WHEN guest.vip = 1 THEN 1 END is ill-formed. That's a conditional expression; it should return a specific value for all rows retrieved by the query -- that is for rows where guest.vip <> 1.
As it is, behaviour is undefined; it produces the expected answer on some DBMS's, as the comments are telling; it doesn't on others, as per the O.P. I guess for those it is producing the expected answer, the DBMS is treating the CASE as returning Null, then the Count( ) is ignoring Nulls. This is one of the more horrible consequences of Null in SQL.
So as per #Fábio Amorim's comment, the CASE needs an ELSE, consequently Count( ) gives an unhelpful result, so get the ELSE to return 0 and Sum( ) the 1 or 0:
SELECT slotguest.FK_SlotNo, Sum(CASE WHEN guest.vip = 1 THEN 1 ELSE 0 END) AS guest_count
FROM guest
INNER JOIN slotguest ON guest.GuestID = slotguest.FK_guest
GROUP BY slotguest.FK_SlotNo;

Use value of one column as identifier in another table in SNOWFLAKE

I have two table one of which contains the rule for another
create table t1(id int, query string)
create table t2(id int, place string)
insert into t1 values (1,'id < 10')
insert into t1 values (2,'id == 10')
And the values in t2 are
insert into t2 values (11,'Nevada')
insert into t2 values (20,'Texas')
insert into t2 values (10,'Arizona')
insert into t2 values (2,'Abegal')
I need to select from second table as per the value of first table column value.
like
select * from t2 where {query}
or
with x(query)
as
(select c2 from test)
select * from test where query;
but neither are helping.

There are a couple of problems with storing criteria in a table like this:
First, as has already been noted, you'll likely have to resort to dynamic SQL, which can get messy, and limits how you can use it.
It's going to be problematic (to say the least) to validate and parse your criteria. What if someone writes a rule of [id] *= 10, or [this_field_doesn't_exist] = blah?
If you're just storing potential values for your [id] column, one solution would be to have your t1 (storing your queries) include a min value and max value, like this:
CREATE TABLE t1
(
[id] INT IDENTITY(1,1) PRIMARY KEY,
min_value INT NULL,
max_value INT NULL
)
Note that both the min and max values can be null. Your provided criteria would then be expressed as this:
INSERT INTO t1
([id], min_value, max_value)
VALUES
(1, NULL, 10),
(2, 10, 10)
Note that I've explicitly referenced what attibutes we're inserting into, as you should also be doing (to prevent issues with attributes being added/modified down the line).
A null value on min_value means no lower limit; a null max_value means no upper limit.
To then get results from t2 that meet all your t1 criteria, simply do an INNER JOIN:
SELECT t2.*
FROM t2
INNER JOIN t1 ON
(t2.id <= t1.max_value OR t1.max_value IS NULL)
AND
(t2.id >= t1.min_value OR t1.min_value IS NULL)
Note that, as I said, this will only return results that match all your criteria. If you need to more complex logic (for example, show records that meet Rules 1, 2 and 3, or meet Rule 4), you'll likely have to resort to dynamic SQL (or at the very least some ugly JOINs).
As stated in a comment, however, you want to have more complex rules, which might mean you have to end up using dynamic SQL. However, you still have the problem of validating and parsing your rule. How do you handle cases where a user enters an invalid rule?
A better solution might be to store your rules in a format that can easily be parsed and validated. For example, come up with an XML schema that defines a valid rule/criterion. Then, your Rules table would have a rule XML attribute, tied to that schema, so users could only enter valid rules. You could then either shred that XML document, or create the SQL client-side to come up with your query.

I got the answer myself. And I am putting it below.
I have used python CLI to do the job. (As snowflake does not support dynamic query)
I believe one can use the same for other DB (tedious but doable)
setting up configuration to connect
CONFIG_PATH = "/root/config/snowflake.json"
with open(CONFIG_PATH) as f:
config = json.load(f)
#snowflake
snf_user = config['snowflake']['user']
snf_pwd = config['snowflake']['pwd']
snf_account = config['snowflake']['account']
snf_region = config['snowflake']['region']
snf_role = config['snowflake']['role']
ctx = snowflake.connector.connect(
user=snf_user,
password=snf_pwd,
account=snf_account,
region=snf_region,
role=snf_role
)
--comment
Used multiple cursor as in loop we don't want recursive connection
cs = ctx.cursor()
cs1 = ctx.cursor()
query = "select c2 from test"
cs.execute(query)
for (x) in cs:
y = "select * from test1 where {0}".format(', '.join(x).replace("'",""))
cs1.execute(y)
for (y1) in cs1:
print('{0}'.format(y1))
And boom done

MERGE INTO Performance

I have a table contains tat contains {service_id, service_name,region_name}
As input my procedure gets service_id , i_svc_region list of key,value pairs, which has {service_name, region}.
Have to insert into the table if the record does not exists already. I know it is a very simple query.. But does the below queries make any difference in performance?
which one is better and why?
MERGE INTO SERVICE_REGION_MAP table1
USING
(SELECT i_svc_region(i).key as service_name,i_enabled_regions(i).value as region
FROM dual) table2
ON (table1.service_id =i_service_id and table1.region=table2.region)
WHEN NOT MATCHED THEN
INSERT (service_id,service_name ,region) VALUES (i_service_id ,table2.service_name,table2.region);
i_service_id - is passed as it is.
MERGE INTO SERVICE_REGION_MAP table1
USING
(SELECT i_service_id as service_id, i_svc_region(i).key as service_name,i_enabled_regions(i).value as region
FROM dual) table2
ON (table1.service_id =table2.service_id and table1.region=table2.region)
WHEN NOT MATCHED THEN
INSERT (service_id,service_name ,region) VALUES (table2.service_id,table2.service_name,table2.region);
i_service_id is considered as column in table.
Does this really make any difference?

You should be using the FORALL statement. It will result in much faster performance than any looping we could write. Check out the documenation, starting with https://docs.oracle.com/database/121/LNPLS/forall_statement.htm#LNPLS01321

As #Brian Leach suggests the FORALL will give you a single round trip to SQL engine for all of the elements (i's) in your table. This can give between 10 and 100 times improvement depending on table size and many other things beyond me.
Also you are only using the INSERT capability of MERGE so a time honoured INSERT statement should make life easier/faster for the database. MERGE has more bells and whistles which can slow it down.
So try something like:
FORALL i IN 1..i_svc_region(i).COUNT
INSERT INTO SERVICE_REGION_MAP table1
(service_id, service_name, region)
SELECT
i_service_id AS service_id,
i_svc_region(i).KEY AS service_name,
i_enabled_regions(i).VALUE AS region
FROM DUAL table2
WHERE NOT EXISTS
( SELECT *
FROM SERVICE_REGION_MAP table1
WHERE table1.service_id=table2.service_id AND table1.region=table2.region
);

Multiple row insert or select if exists

CREATE TABLE object (
object_id serial,
object_attribute_1 integer,
object_attribute_2 VARCHAR(255)
)
-- primary key object_id
-- btree index on object_attribute_1, object_attribute_2
Here is what I currently have:
SELECT * FROM object
WHERE (object_attribute_1=100 AND object_attribute_2='Some String') OR
(object_attribute_1=200 AND object_attribute_2='Some other String') OR
(..another row..) OR
(..another row..)
When the query returns, I check for what is missing (thus, does not exist in the database).
Then I will make an multiple row insert:
INSERT INTO object (object_attribute_1, object_attribute_2)
VALUES (info, info), (info, info),(info, info)
Then I will select what I just inserted
SELECT ... WHERE (condition) OR (condition) OR ...
And at last, I will merge the two selects on the client side.
Is there a way that I can combine these 3 queries, into one single queries, where I will provide all the data, and INSERT if the records do not already exist and then do a SELECT in the end.

Your suspicion was well founded. Do it all in a single statement using a data-modifying CTE (Postgres 9.1+):
WITH list(object_attribute_1, object_attribute_2) AS (
VALUES
(100, 'Some String')
, (200, 'Some other String')
, .....
)
, ins AS (
INSERT INTO object (object_attribute_1, object_attribute_2)
SELECT l.*
FROM list l
LEFT JOIN object o1 USING (object_attribute_1, object_attribute_2)
WHERE o1.object_attribute_1 IS NULL
RETURNING *
)
SELECT * FROM ins -- newly inserted rows
UNION ALL -- append pre-existing rows
SELECT o.*
FROM list l
JOIN object o USING (object_attribute_1, object_attribute_2);
Note, there is a tiny time frame for a race condition. So this might break if many clients try it at the same time. If you are working under heavy concurrent load, consider this related answer, in particular the part on locking or serializable transaction isolation:
Postgresql batch insert or ignore

Comparing data in two (or more) ranges in the same field

I have a simple table as follows
SQL> select * from test;
ID STUFF
---------- ------------------------------------------------------------
1 a
2 b
3 c
4 d
5 e
6 f
7 g
7 rows selected.
I'd like to construct a query that returns something like this:
STUFF A STUFF B
---------- --------------------------------------
a e
b f
c g
d NULL
That is, take two ranges determined by the id, with missing values padded by NULL. The ranges are continuous, may overlap, and are different lengths.
Is this possible? If so, what's the query?
Temp table sql:
CREATE TABLE test(id number, stuff VARCHAR(20));
INSERT INTO test VALUES (1, 'a');
INSERT INTO test VALUES (2, 'b');
INSERT INTO test VALUES (3, 'c');
INSERT INTO test VALUES (4, 'd');
INSERT INTO test VALUES (5, 'e');
INSERT INTO test VALUES (6, 'f');
INSERT INTO test VALUES (7, 'g');

select a.stuff as stuffa, b.stuff as stuffb
from test as a
left join test as b
on (a.id-:minida) = (b.id-:minidb)
and b.id between :minidb and :maxidb
where a.id between :minida and :maxida
(where the colon denoted identifiers for values that get bound to a prepared statement) should work if (maxidb-minidb) <= (maxida-minida). But this doesn't work in a completely symmetrical way (where either range may be larger than the other).
A totally symmetrical query as you describe could no doubt be written as a remarkably tedious UNION which basically repeats each part (with appropriate swapping) and adds the above <= expression as a condition the first time, same but with > instead the second time (asymmetrically, in case the ranges are equal;-), so that one of the two halves of the union is guaranteed to be empty, but I'd have to be seriously compensated for the tedium of actually writing out said union;-).
If your favorite SQL dialect supports FULL OUTER JOIN then that could help... but many dialects, such as MySQL and SQLite, do not support the FULL version.

I'm having a hard time wrapping my head around what you're trying to do in general. But this statement will work on the example you gave. Hopefully that points you in the right direction for the more generic solution.
select stuff as stuffa,(select top 1 stuff from test where id >= t.id + 4 )stuffb from test t where t.Id<=4

#Alex
Nice work. I echo your concerns for the tedium in generating symmetric results. Don't bother. If it's going to become complex for this simple example, it will become a nightmare for production code (not to mention the possibility of comparing more than two ranges).
#tetra
Basically any valid range for id. The lengths and position of the ranges are unknown until deployment.
#Everyone
A neat solution is preferred, and "it can't be done neatly" is a fine solution. I can tell the client, 'it can't be done'.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Refactoring SELECT with two correlated subqueries - sql

Related

SQL Count: erratic behaviour

Use value of one column as identifier in another table in SNOWFLAKE

MERGE INTO Performance

Multiple row insert or select if exists

Comparing data in two (or more) ranges in the same field

Categories

Resources