SQL Query Optimization to retrieve non-null entries - sql

Need help in optimizing SQL query, I have figured a way to solve the problem by using UNIONALL, but my worry is that performance will be impacted as the record set is huge in production env.
I have a table of records in below format, I need help in retrieving the non-null entries if available otherwise pick the null entries.
In the below case; Query should exclude RowIds 1,7 and retrieve everything else, i.e because there are non-null entries for that combination.
RowID
UniqueID
TrackId
1
325
NULL
2
325
8zUAC
3
325
99XER
4
427
NULL
5
632
2kYCV
6
533
NULL
7
774
NULL
8
774
94UAC
--UNIONALL Command
SELECT A.* FROM
( SELECT * FROM [MY_PKG].[TEMP] WHERE TRACKID is not null) A
WHERE A.UNIQUEID in
( SELECT UNIQUEID FROM [MY_PKG].[TEMP] WHERE TRACKID is null
)
UNION ALL
SELECT B.* FROM
( SELECT * FROM [MY_PKG].[TEMP] WHERE TRACKID is null) B
WHERE B.UNIQUEID not in
( SELECT UNIQUEID FROM [MY_PKG].[TEMP] WHERE TRACKID is not null
)
Temp Table Creation Scrip
CREATE TABLE MY_PKG.TEMP
( UNIQUEID varchar(3),
TRACKID varchar(5)
);
INSERT INTO MY_PKG.TEMP
( UNIQUEID, TRACKID)
VALUES
('325',null),
('325','8zUAC'),
('325','99XER'),
('427',null),
('632','2kYCV'),
('533','2kYCV'),
('774',null),
('774','94UAC')

You can use the NOT EXISTS operator with a correlated subquery:
SELECT * FROM TEMP T
WHERE TRACKID IS NOT NULL
OR (TRACKID IS NULL
AND NOT EXISTS(
SELECT 1 FROM TEMP D
WHERE D.UNIQUEID = T.UNIQUEID AND
D.TRACKID IS NOT NULL)
)
See demo

Related

Needing system defined function to select updated or unmatched new records from two tables

I am having a live data table in which the old values are placed,in a new table i am moving data from that live table to this one how to find updated or new records that are inserted or updated in new table with out using except,checksum(binary_checksum) and join ,i am looking for a solution using System Defined Function.
The requirement is interesting as the best solutions are to use EXCEPT or a FULL JOIN. What you are trying to do is what is referred to as an left anti semi join. Here's a good article about the topic.
Note this sample data and the solutions (note that my solution that does not use EXCEPT or a join is the last solution):
-- sample data
if object_id('tempdb.dbo.orig') is not null drop table dbo.orig;
if object_id('tempdb.dbo.new') is not null drop table dbo.new;
create table dbo.orig (someid int, col1 int, constraint uq_cl_orig unique (someid, col1));
create table dbo.new (someid int, col1 int, constraint uq_cl_new unique (someid, col1));
insert dbo.orig values (1,100),(2,110),(3,120),(4,2000)
insert dbo.new values (1,100),(2,110),(3,122),(5,999);
Here's the EXCEPT version
select someid
from
(
select * from dbo.new except
select * from dbo.orig
) n
union -- union "distict"
select someid
from
(
select * from dbo.orig except
select * from dbo.new
) o;
Here's a FULL JOIN Solution which will also tell you if the record was removed, changed or added:
select
someid = isnull(n.someid, o.someid),
[status] =
case
when count(isnull(n.someid, o.someid)) > 1 then 'changed'
when max(n.col1) is null then 'removed' else 'added'
end
from dbo.new n
full join dbo.orig o
on n.col1=o.col1 and n.someid = o.someid
where n.col1 is null or o.col1 is null
group by isnull(n.someid, o.someid);
But, because those efficient solutions are not an option - you will need to go with a NOT IN or NOT EXISTS subquery.... And because it has to be a function, I am encapsulating the logic into a function.
create function dbo.newOrChangedOrRemoved()
returns table as return
-- get the new records
select someid, [status] = 'new'
from dbo.new n
where n.someid not in (select someid from dbo.orig)
union all
-- get the removed records
select someid, 'removed'
from dbo.orig o
where o.someid not in (select someid from dbo.new)
union all
-- get the changed records
select someid, 'changed'
from dbo.orig o
where exists
(
select *
from dbo.new n
where o.someid = n.someid and o.col1 <> n.col1
);
Results:
someid status
----------- -------
5 new
4 removed
3 changed

SQL Server match and count on substring

I am using SQL Server 2008 R2 and have a table like this:
ID Record
1 IA12345
2 IA33333
3 IA33333
4 IA44444
5 MO12345
I am trying to put together some SQL to return the two rows that contain IA12345 and MO12345. So, I need to match on the partial string of the column "Record". What is complicating my SQL is that I don't want to return matches like IA33333 and IA33333. Clear as mud?
I am getting twisted up in substrings, group by, count and the like!
SELECT ID, Record FROM Table WHERE Record LIKE '%12345'
Select *
from MyTable
where Record like '%12345%'
This will find repeating and/or runs. For example 333 or 123 or 321
Think of it as Rummy 500
Declare #YourTable table (ID int,Record varchar(25))
Insert Into #YourTable values
( 1,'IA12345'),
( 2,'IA33333'),
( 3,'IA33333'),
( 4,'IA44444'),
( 5,'MO12345'),
( 6,'M785256') -- Will be excluded because there is no pattern
Declare #Num table (Num int);Insert Into #Num values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
Select Distinct A.*
From #YourTable A
Join (
Select Patt=replicate(Num,3) from #Num
Union All
Select Patt=right('000'+cast((Num*100+Num*10+Num)+12 as varchar(5)),3) from #Num where Num<8
Union All
Select Patt=reverse(right('000'+cast((Num*100+Num*10+Num)+12 as varchar(5)),3)) from #Num where Num<8
) B on CharIndex(Patt,Record)>0
Returns
ID Record
1 IA12345
2 IA33333
3 IA33333
4 IA44444
5 MO12345
EDIT
I should add that runs of 3 is too small, it is a small matter tweak the sub-queries so 333 becomes 3333 and 123 becomes 1234

How can I get a sql query that always would be multiple of n?

I need to obtain a query result that would be showing in multiples of a determined number (10 in my case), independent of the real quantity of rows (actually to solve a jasper problem).
For example, in this link, I build a example schema: http://sqlfiddle.com/#!3/c3dba/1/0
I'd like the result to be like this:
1 Item 1 1 10
2 Item 2 2 30
3 Item 3 5 15
4 Item 4 2 10
null null null null null
null null null null null
null null null null null
null null null null null
null null null null null
null null null null null
I have found this explanation, but doesn't work in SQLServer and I can't convert: http://community.jaspersoft.com/questions/514706/need-table-fixed-size-detail-block
Another option is to use a recursive CTE to get the pre-determined number of rows, then use a nested CTE construct to union rows from the recursive CTE with the original table and finally use a TOP clause to get the desired number of rows.
DECLARE #n INT = 10
;WITH Nulls AS (
SELECT 1 AS i
UNION ALL
SELECT i + 1 AS i
FROM Nulls
WHERE i < #n
),
itemsWithNulls AS
(
SELECT * FROM itens
UNION ALL
SELECT NULL, NULL, NULL, NULL FROM Nulls
)
SELECT TOP (#n) *
FROM itemsWithNulls
EDIT:
By reading the requirements more carefully, the OP actually wants the total number of rows returned to be a multiple of 10. E.g. if table itens has 4 rows then 10 rows should be returned, if itens has 12 rows then 20 rows should be return, etc.
In this case #n should be set to:
DECLARE #n INT = ((SELECT COUNT(*) FROM itens) / 10 + 1) * 10
We can actually fit everything inside a single sql statement with the use of nested CTEs:
;WITH NumberOfRows AS (
SELECT n = ((SELECT COUNT(*) FROM itens) / 10 + 1) * 10
), Nulls AS (
SELECT 1 AS i
UNION ALL
SELECT i + 1 AS i
FROM Nulls
WHERE i < (SELECT n FROM NumberOfRows)
),
itemsWithNulls AS
(
SELECT * FROM itens
UNION ALL
SELECT NULL, NULL, NULL, NULL FROM Nulls
)
SELECT TOP (SELECT n FROM NumberOfRows) *
FROM itemsWithNulls
SQL Fiddle here
This might work for you - use an arbitrary cross join to create a large number of null rows, and then union them back in with your real data. You'll need to pay extra attention to the ORDERING to ensure that it is the nulls at the bottom.
DECLARE #NumRows INT = 50;
SELECT TOP (#NumRows) *
FROM
(
SELECT * FROM itens
UNION
SELECT TOP (#NumRows) NULL, NULL, NULL, NULL
FROM sys.objects o1 CROSS JOIN sys.objects o2
) x
ORDER BY CASE WHEN x.ID IS NULL THEN 9999 ELSE ID END
Fiddle Demo
This is super simple. You use a tally as the main table in your query.
http://sqlfiddle.com/#!3/c3dba/20
You can read more about tally tables here.
http://www.sqlservercentral.com/articles/T-SQL/62867/
In the context of a proc/script, you can do your initial query into a table variable or temp table, check ##ROWCOUNT, or query the count of rows in the table, and then do a FOR loop to populate the rest of the rows. Finally, select * from your table variable/temp table.

Can this ORDER BY on a CASE clause be made faster?

I'm selecting results from a table of ~350 million records, and it's running extremely slowly - around 10 minutes. The culprit seems to be the ORDER BY, as if I remove it the query only takes a moment. Here's the gist:
SELECT TOP 100
(columns snipped)
FROM (
SELECT
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
Basically, I'm ordering the results by whether Files have a corresponding Record, as those without need to be handled first. I could run two queries with WHERE clauses, but I'd rather do it in a single query if possible.
Is there any way to speed this up?
The ultimate issue is that the use of CASE in the sub-query introduces an ORDER BY over something that is not being used in a sargable manner. Thus the entire intermediate result-set must first be ordered to find the TOP 100 - this is all 350+ million records!2
In this particular case, moving the CASE to the outside SELECT and use a DESC ordering (to put NULL values, which means "0" in the current RecordExists, first) should do the trick1. It's not a generic approach, though .. but the ordering should be much, much faster iff Files.ID is indexed. (If the query is still slow, consult the query plan to find out why ORDER BY is not using an index.)
Another alternative might be to include a persisted computed column for RecordExists (that is also indexed) that can be used as an index in the ORDER BY.
Once again, the idea is that the ORDER BY works over something sargable, which only requires reading sequentially inside the index (up to the desired number of records to match the outside limit) and not ordering 350+ million records on-the-fly :)
SQL Server is then able to push this ordering (and limit) down into the sub-query, instead of waiting for the intermediate result-set of the sub-query to come up. Look at the query plan differences based on what is being ordered.
1 Example:
SELECT TOP 100
-- If needed
CASE WHEN (p1.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM (
SELECT
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
-- Hopefully ID is indexed, DESC makes NULLs (!RecordExists) go first
ORDER BY p1.ID DESC
2 Actually, it seems like it could hypothetically just stop after the first 100 0's without a full-sort .. at least under some extreme query planner optimization under a closed function range, but that depends on when the 0's are encountered in the intermediate result set (in the first few thousand or not until the hundreds of millions or never?). I highly doubt SQL Server accounts for this extreme case anyway; that is, don't count on this still non-sargable behavior.
Give this form a try
SELECT TOP(100) *
FROM (
SELECT TOP(100)
0 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
WHERE NOT EXISTS (SELECT * FROM dbo.Records e2 WHERE e1.FID = e2.FID)
ORDER BY SecondaryOrderColumn
) X
UNION ALL
SELECT * FROM (
SELECT TOP(100)
1 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
INNER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY SecondaryOrderColumn
) X
ORDER BY SecondaryOrderColumn
Key indexes:
Records (FID)
Files (FID, SecondaryOrdercolumn)
Well the reason it is much slower is because it is really a very different query without the order by clause.
With the order by clause:
Find all matching records out of the entire 350 million rows. Then sort them.
Without the order by clause:
Find the first 100 matching records. Stop.
Q: If you say the only difference is "with/outout" the "order by", then could you somehow move the "top 100" into the inner select?
EXAMPLE:
SELECT
(columns snipped)
FROM (
SELECT TOP 100
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
In SQL Server, null values collate lower than any value in the domain. Given these two tables:
create table dbo.foo
(
id int not null identity(1,1) primary key clustered ,
name varchar(32) not null unique nonclustered ,
)
insert dbo.foo ( name ) values ( 'alpha' )
insert dbo.foo ( name ) values ( 'bravo' )
insert dbo.foo ( name ) values ( 'charlie' )
insert dbo.foo ( name ) values ( 'delta' )
insert dbo.foo ( name ) values ( 'echo' )
insert dbo.foo ( name ) values ( 'foxtrot' )
go
create table dbo.bar
(
id int not null identity(1,1) primary key clustered ,
foo_id int null foreign key references dbo.foo(id) ,
name varchar(32) not null unique nonclustered ,
)
go
insert dbo.bar( foo_id , name ) values( 1 , 'golf' )
insert dbo.bar( foo_id , name ) values( 5 , 'hotel' )
insert dbo.bar( foo_id , name ) values( 3 , 'india' )
insert dbo.bar( foo_id , name ) values( 5 , 'juliet' )
insert dbo.bar( foo_id , name ) values( 6 , 'kilo' )
go
The query
select *
from dbo.foo foo
left join dbo.bar bar on bar.foo_id = foo.id
order by bar.foo_id, foo.id
yields the following result set:
id name id foo_id name
-- ------- ---- ------ -------
2 bravo NULL NULL NULL
4 delta NULL NULL NULL
1 alpha 1 1 golf
3 charlie 3 3 india
5 echo 2 5 hotel
5 echo 4 5 juliet
6 foxtrot 5 6 kilo
(7 row(s) affected)
This should allow the query optimizer to use a suitable index (if such exists); however, it does not guarantee than any such index would be used.
Can you try this?
SELECT TOP 100
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY e2.ID ASC
This should give you where e2.ID is null first. Also, make sure Records.ID is indexed. This should give you the ordering you were wanting.

In postgresql, how can I fill in missing values within a column?

I'm trying to figure out how to fill in values that are missing from one column with the non-missing values from other rows that have the same value on a given column. For instance, in the below example, I'd want all the "1" values to be equal to Bob and all of the "2" values to be equal to John
ID # | Name
-------|-----
1 | Bob
1 | (null)
1 | (null)
2 | John
2 | (null)
2 | (null)
`
EDIT: One caveat is that I'm using postgresql 8.4 with Greenplum and so correlated subqueries are not supported.
CREATE TABLE bobjohn
( ID INTEGER NOT NULL
, zname varchar
);
INSERT INTO bobjohn(id, zname) VALUES
(1,'Bob') ,(1, NULL) ,(1, NULL)
,(2,'John') ,(2, NULL) ,(2, NULL)
;
UPDATE bobjohn dst
SET zname = src.zname
FROM bobjohn src
WHERE dst.id = src.id
AND dst.zname IS NULL
AND src.zname IS NOT NULL
;
SELECT * FROM bobjohn;
NOTE: this query will fail if more than one name exists for a given Id. (and it won't touch records for which no non-null name exists)
If you are on a postgres version >-9, you could use a CTE to fetch the source tuples (this is equivalent to a subquery, but is easier to write and read (IMHO). The CTE also tackles the duplicate values-problem (in a rather crude way):
--
-- CTE's dont work in update queries for Postgres version below 9
--
WITH uniq AS (
SELECT DISTINCT id
-- if there are more than one names for a given Id: pick the lowest
, min(zname) as zname
FROM bobjohn
WHERE zname IS NOT NULL
GROUP BY id
)
UPDATE bobjohn dst
SET zname = src.zname
FROM uniq src
WHERE dst.id = src.id
AND dst.zname IS NULL
;
SELECT * FROM bobjohn;
UPDATE tbl
SET name = x.name
FROM (
SELECT DISTINCT ON (id) id, name
FROM tbl
WHERE name IS NOT NULL
ORDER BY id, name
) x
WHERE x.id = tbl.id
AND tbl.name IS NULL;
DISTINCT ON does the job alone. Not need for additional aggregation.
In case of multiple values for name, the alphabetically first one (according to the current locale) is picked - that's what the ORDER BY id, name is for. If name is unambiguous you can omit that line.
Also, if there is at least one non-null value per id, you can omit WHERE name IS NOT NULL.
If you know for a fact that there are no conflicting values (multiple rows with the same ID but different, non-null names) then something like this will update the table appropriately:
UPDATE some_table AS t1
SET name = (
SELECT name
FROM some_table AS t2
WHERE t1.id = t2.id
AND name IS NOT NULL
LIMIT 1
)
WHERE name IS NULL;
If you only want to query the table and have this information filled in on the fly, you can use a similar query:
SELECT
t1.id,
(
SELECT name
FROM some_table AS t2
WHERE t1.id = t2.id
AND name IS NOT NULL
LIMIT 1
) AS name
FROM some_table AS t1;