CTE vs. IN clause performance - sql

Alright SQL Server Gurus, fire up your analyzers.
I have a list of titles in application memory (250 or so).
I have a database table "books" with greater than a million records, one of the columns of the table is "title" and is of type nvarchar.
the "books" table has another column called "ISBN"
books.title is not a primary key, is not unique, but is indexed.
So I'd like to know which is more efficient:
WITH titles AS (select 'Catcher and the Rye' as Title
union all 'Harry Potter ...'
...
union all 'The World Is Flat')
select ISBN from books, titles where books.title = titles.title;
OR:
select ISBN from books where title in ('Catcher and the Rye','Harry Potter',...,'The World Is Flat');
OR:
???

I hope you have ISBN includes on the title index to avoid key lookups
CREATE INDEX IX_Titles ON dbo.Books (Title) INCLUDE (ISBN)
Now, the IN vs JOIN vs EXISTs is a common question here. The CTE is irrelevant except for readability. Personally, I'd use exists because you'll get duplicates with JOIN for books with the same title, which folk often forget.
;WITH titles AS (select 'Catcher and the Rye' as Title
union all 'Harry Potter ...'
...
union all 'The World Is Flat')
SELECT
ISBN
FROM
books
WHERE
EXISTS (SELECT * --or null or = all the same
FROM
titles
WHERE
titles .Title = books.Title)
However, one construct I'd consider is this to force "intermediate materialisation" on my list of search titles. The also applies to an exists or CTE solution too. This is likely to help the optimiser considerably.
Edit: a temp table is a better option, really, as Steve mentioned in his comment
SELECT
ISBN
FROM
(
SELECT TOP 2000000000
Title
FROM
(select 'Catcher and the Rye' as Title
union all 'Harry Potter ...'
...
union all 'The World Is Flat'
) foo
ORDER BY
Title
) bar
JOIN
books On bar.Title = books.Title
SELECT
ISBN
FROM
books
WHERE
EXISTS (SELECT * --or null or = all the same
FROM
(
SELECT TOP 2000000000
Title
FROM
(select 'Catcher and the Rye' as Title
union all 'Harry Potter ...'
...
union all 'The World Is Flat'
) foo
ORDER BY
Title
) bar
WHERE
bar.Title = books.Title)

Given the choice of the two options, avoid IN clauses, as the number of items within the list goes up the query plan will alter and very quickly convert from a potential Seek to a Scan.
The normal tipping point (and I double checked on the adventure works) is that on the 65th item, it changes plan to a scan from a seek.

Related

Duplicate JSON element in row_to_json() with self-join

This is a follow-up to this excellent Q&A: 13227142.
I almost have to do the same thing (with the constraint of PostgreSQL 9.2) but I'm using only one table. Therefore the query uses a self-join (in order to produce the correct JSON format) which results in a duplicate id field. How can I avoid this?
Example:
CREATE TABLE books
(
id serial primary key,
isbn text,
author text,
title text,
edition text,
teaser text
);
SELECT row_to_json(row)
FROM
(
SELECT id AS bookid,
author,
cover
FROM books
INNER JOIN
(
SELECT id, title, edition, teaser
FROM books
) cover(id, title, edition, teaser)
USING (id)
) row;
Result:
{
"bookid": 1,
"author": "Bjarne Stroustrup",
"cover": {
"id": 1,
"title": "Design and Evolution of C++",
"edition": "1st edition",
"teaser": "This book focuses on the principles, processes and decisions made during the development of the C++ programming language"
}
}
I want to get rid of "id" in "cover".
This turned out to be a tricky task. As far as I can see it's impossible to achieve with a simple query.
One solution is to use a predefined data type:
CREATE TYPE bookcovertype AS (title text, edition text, teaser text);
SELECT row_to_json(row)
FROM
(
SELECT books.id AS bookid, books.author,
row_to_json(row(books.title, books.edition, books.teaser)::bookcovertype) as cover
FROM books
) row;
you need id to join, so without id you can't make such short query. You need to struct it. Smth like:
select row_to_json(row,true)
FROM
(
with a as (select id,isbn,author,row_to_json((title,edition,teaser)) r from books
)
select a.id AS bookid,a.author, concat('{"title":',r->'f1',',"edition":',r->'f2',',"teaser":',r->'f3','}')::json as cover
from a
) row;
row_to_json
--------------------------------------------------------
{"bookid":1, +
"author":"\"b\"", +
"cover":{"title":"c","edition":"d","teaser":"\"b\""}}
(1 row)
Also without join you use twice as less resources
For the sake of completeness I've stumbled upon another answer myself: The additional fields can be eliminated by string functions. However, I prefer AlexM's anwer because it will be faster and is still compatible with PostgreSQL 9.2.
SELECT regexp_replace(
(
SELECT row_to_json(row)
FROM
(
SELECT id AS bookid,
author,
cover
FROM books
INNER JOIN
(
SELECT id, title, edition, teaser
FROM books
) cover(id, title, edition, teaser)
USING (id)
) row
)::text,
'"id":\d+,',
'')

How to show one to many relationship between two tables in SQL?

I have two tables A and B.
A table contain
postid,postname,CategoryURl
and
B table contain
postid,CategoryImageURL
For one postid there are multiple CategoryImageURL assigned.I want to display that CategoryImageURL in Table A but for one postid there should be CategoryImageURL1,CategoryImageURL2 should be like that one.
I want to achieve one to many relationship for one postid then what logic should be return in sql function??
In my eyes it seems that you want to display all related CategoryImageURLs of the second table in one line with a separator in this case the comma?
Then you will need a recursive operation there. Maybe a CTE (Common Table Expression) does the trick. See below. I have added another key to the second table, to be able to check, if all rows of the second table have been processed for the corresponding row in the first table.
Maybe this helps:
with a_cte (post_id, url_id, name, list, rrank) as
(
select
a.post_id
, b.url_id
, a.name
, cast(b.urln + ', ' as nvarchar(100)) as list
, 0 as rrank
from
dbo.a
join dbo.b
on a.post_id = b.post_id
union all
select
c.post_id
, a1.url_id
, c.name
, cast(c.list + case when rrank = 0 then '' else ', ' end + a1.urln as nvarchar(100))
, c.rrank + 1
from a_cte c
join ( select
b.post_id
, b.url_id
, a.name
, b.urln
from dbo.a
join dbo.b
on a.post_id = b.post_id
) a1
on c.post_id = a1.post_id
and c.url_id < a1.url_id -- ==> take care, that there is no endless loop
)
select d.name, d.list
from
(
select name, list, rank() over (partition by post_id order by rrank desc)
from a_cte
) d (name, list, rank)
where rank = 1
You are asking the wrong sort of question. This is about normalization.
As it stands, you have a redundancy? Where each postname and categoryURL is represented by an ID field.
For whatever reason, the tables separated CategoryImageUrl into its own table and linked this to each set of postname and categoryURL.
If the relation is actually one id to each postname, then you can denormalize the table by adding the column CategoryImageUrl to your first table.
Postid, postname, CategoryURL, CategoryImageUrl
Or if you wish to keep the normalization, combine like fields into their own table like so:
--TableA:
Postid, postname, <any other field dependent on postname >
--TableA
Postid, CategoryURL, CategoryImageUrl
Now this groups CategoryURL together but uses a redundancy of having multiple CategoryURL to exist. However, Postid has only one CategoryUrl.
To remove this redundancy in our table, we could use a Star Schema strategy like this:
-- Post Table
Postid, postname
-- Category table
CategoryID, CategoryURL, <any other info dependent only on CategoryURL>
-- Fact Table
Postid, CategoryID, CategoryImageURL
DISCLAIMER: Naturally I assumed aspects of your data and might be off. However, the strategy of normalization is still the same.
Also, remember that SQL is relational and deals with sets of data. Inheritance is incompatible to the relational set theory. Every table can be queried forwards and backwards much the way every page and chapter in a book is treated as part of the book. At no point would we see a chapter independent of a book.

Efficient way of getting group ID without sorting

Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?
Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
Which columns are indexed?
How large is the data? Does it fit in memory?
How many different countries are there?
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE #country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO #country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN #country c
ON c.countryName = p.countryName
group by use also sort operator in background (group is based on 'sort and compare' like Icomparable in C#)

SQL Queries instead of Cursors

I'm creating a database for a hypothetical video rental store.
All I need to do is a procedure that check the availabilty of a specific movie (obviously the movie can have several copies). So I have to check if there is a copy available for the rent, and take the number of the copy (because it'll affect other trigger later..).
I already did everything with the cursors and it works very well actually, but I need (i.e. "must") to do it without using cursors but just using "pure sql" (i.e. queries).
I'll explain briefly the scheme of my DB:
The tables that this procedure is going to use are 3: 'Copia Film' (Movie Copy) , 'Include' (Includes) , 'Noleggio' (Rent).
Copia Film Table has this attributes:
idCopia
Genere (FK references to Film)
Titolo (FK references to Film)
dataUscita (FK references to Film)
Include Table:
idNoleggio (FK references to Noleggio. Means idRent)
idCopia (FK references to Copia film. Means idCopy)
Noleggio Table:
idNoleggio (PK)
dataNoleggio (dateOfRent)
dataRestituzione (dateReturn)
dateRestituito (dateReturned)
CF (FK to Person)
Prezzo (price)
Every movie can have more than one copy.
Every copy can be available in two cases:
The copy ID is not present in the Include Table (that means that the specific copy has ever been rented)
The copy ID is present in the Include Table and the dataRestituito (dateReturned) is not null (that means that the specific copy has been rented but has already returned)
The query I've tried to do is the following and is not working at all:
SELECT COUNT(*)
FROM NOLEGGIO
WHERE dataNoleggio IS NOT NULL AND dataRestituito IS NOT NULL AND idNoleggio IN (
SELECT N.idNoleggio
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
WHERE idCopia IN (
SELECT idCopia
FROM COPIA_FILM
WHERE titolo='Pulp Fiction')) -- Of course the title is just an example
Well, from the query above I can't figure if a copy of the movie selected is available or not AND I can't take the copy ID if a copy of the movie were available.
(If you want, I can paste the cursors lines that work properly)
------ USING THE 'WITH SOLUTION' ----
I modified a little bit your code to this
WITH film
as
(
SELECT idCopia,titolo
FROM COPIA_FILM
WHERE titolo = 'Pulp Fiction'
),
copy_info as
(
SELECT N.idNoleggio, N.dataNoleggio, N.dataRestituito, I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio = I.idNoleggio
),
avl as
(
SELECT film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_film.dataRestituito,film.idCopia
FROM film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
SELECT COUNT(*),idCopia FROM avl
WHERE(dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
As I said in the comment, this code works properly if I use it just in a query, but once I try to make a procedure from this, I got errors.
The problem is the final SELECT:
SELECT COUNT(*), idCopia INTO CNT,COPYFILM
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
The error is:
ORA-01422: exact fetch returns more than requested number of rows
ORA-06512: at "VIDEO.PR_AVAILABILITY", line 9.
So it seems the Into clause is wrong because obviously the query returns more rows. What can I do ? I need to take the Copy ID (even just the first one on the list of rows) without using cursors.
You can try this -
WITH film
as
(
SELECT idCopia, titolo
FROM COPIA_FILM
WHERE titolo='Pulp Fiction'
),
copy_info as
(
select N.idNoleggio, I.dataNoleggio , I.dataRestituito , I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
),
avl as
(
select film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_info.dataRestituito
from film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
select * from avl
where (dataRestituito IS NOT NULL OR idNoleggio IS NULL);
You should think in terms of sets, rather than records.
If you find the set of all the films that are out, you can exclude them from your stock, and the rest is rentable.
select copiafilm.* from #f copiafilm
left join
(
select idCopia from #r Noleggio
inner join #i include on Noleggio.idNoleggio = include.idNoleggio
where dateRestituito is null
) out
on copiafilm.idCopia = out.idCopia
where out.idCopia is null
I solved the problem editing the last query into this one:
SELECT COUNT(*),idCopia INTO CNT,idCopiaFilm
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL) AND rownum = 1
GROUP BY idCopia;
IF CNT > 0 THEN
-- FOUND AVAILABLE COPY
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
-- NOT FOUND AVAILABLE COPY
Thank you #Aditya Kakirde ! Your suggestion almost solved the problem.

TSQL query that uses function and view is very slow

Ok, first of all thanks in advance if you read through this whole thing as it may be quite painful on several levels.
It's a long post
It's gross
It's going to probably make your brain hurt
But on the plus side, after reading through this whole thing I have a feeling the answer is very obvious and simple, so you have that going for you.
So I'll tell you the problem in a nutshell, and then in more detail:
Nutshell
I have a query in SQL Server 2008r2 that is taking a very long time to complete.
I have several tables that contain information about a child and its parent.
A child in one table can have a parent in another table which then could have a parent in another table (there are only 3 tables).
I want to be able to take a child's name as a string and figure out it's heirarchy of ancestors and return that as a period delimited string. So Grandpappy.Grandpa.Dad.Me.
I have this all working, it's just taking forever so I'm doing something stupid, or poorly performant, or most likely both.
I have NO control over the tables, they are what they are and I can't do anything to them. I created a view and a function (which you will see below) and that is all I can control.
The table names and values below are obviously fictitious.
Detailed description
Here are the tables that indicate children and parents. In this example we will be dealing with Fruits, Vegetables, and Planets.
A Planet has no parents.
A Fruit has a Parent who is a Planet, or a Fruit.
A Vegetable has a Parent who is a Fruit, or a Planet, or a Vegetable.
Let's take a look at them...
Table 1 = Planets (I have no parents)
ID, Name
1, Earth
2, Saturn
Table 2 = Fruits (my parent is either a planet or a fruit)
ID, Name, PlanetName, FruitName
1, Kiwi, Earth, null
2, Strawberry, Saturn, null
3, Banana, null, Strawberry
Table 3 = Vegetables (my parent is planet or a fruit or a vegetable)
ID, Name, FruitName, PlanetName, VegetableName
1, Potato, Kiwi, null, null
2, Squash, null, Earth, null
3, Pumpkin, null, null, Potato
Table 4 = BigTable (this will be the one the main slow query is using. It has a column that contains just a child's name and it could be a planet or a fruit or a vegetable)
ID, Name, OneOfTheThree
1, John, Earth
2, Steve, Kiwi
3, Joe, Saturn
4, Jane, Potato
We have our tables and we have our data, what do I want to do now?
I want to create a query that looks at all of the OneOfTheThree values in the BigTable and find out what their lineage is (who there dads, grand parents etc are) and return that to the caller.
So my thought was to do this:
Create a view that pulls the three tables (Planet, Fruit, Vegetable) into one single view that shows Name and Parent.
Create a function that takes in a Name. It then uses that view to find out who the Parent is for that Name. It then looks to see who the Parent is for that Parent, and on and on until the Parent is null and it stops (because that's the top of the ancestry chain... we made it all the way to Planet, who has no parents).
Create a query to query BigTable and then use the above function on BigTable's OneOfTheThree column to get the ancestry of the name in OneOfTheThree.
So I did it as follows:
My view
View = vwEverybodyAndTheirParents
-- Planets
SELECT Name, null AS Parent
FROM Planets
UNION
-- Fruits
SELECT Name, PlanetName AS Parent
FROM Fruits
UNION
-- Vegetables
SELECT Name, CASE WHEN FruitName IS NOT NULL THEN FruitName WHEN PlanetName IS NOT NULL THEN Planet ELSE NULL END AS Parent
FROM Vegetables
Ok, that gives me everything and it's parents. Now for the function to crawl that view and give me the period delimited string of the full ancestry:
My function
CREATE FUNCTION dbo.fnGetMyParent(#NameToGetParentsFor varchar(255))
RETURNS varchar(255)
AS
DECLARE #InternalName varchar(255)
DECLARE #ParentName varchar(255)
DECLARE #ConcatenatedParentStringToReturn varchar(max)
SELECT #ParentName = Parent
,#ConcatenatedParentStringToReturn = Name
FROM vwEverybody
WHERE Name = #NameToGetParentsFor
WHILE #ParentName IS NOT NULL
BEGIN
SELECT #InternalName = Name,
#ParentName = Parent
FROM vwEverybody
WHERE Name = #ParentName
SET #ConcatenatedParentStringToReturn = RTRIM(InternalName) + "." + RTRIM(#ConcatenatedParentStringToReturn)
END
RETURN #ConcatenatedParentStringToReturn
END
This function works fine (though could be poorly coded and poorly performing?), so with all the above examples if I were to call it like so:
dbo.fnGetMyParent('Potato')
I get back the concatenated string of:
Earth.Kiwi.Potato
The problem
Ok, so now to finally get to the problem... the big query that takes forever:
SELECT Name,
OneOfTheThree,
fnGetMyParent(OneOfTheThree) as HeirarchyOfParents
FROM BigTable
I can see why it could take so long as for each value it executes the function which needs to then crawl a view. So...
My questions to you
How can I speed this up?
Do I need to put an index on the view?
Is my approach off, and should I do this differently?
If so, what do you recommend?
A BIG THANK YOU if you made it this far!
First of all when using sql you should avoid using loops as much as you can (unless the situation asks for it)
Second, there is no need of the view, or of the function as your query should be easily written in one go.
select
bt.Name
,bt.OneOfTheThree
,p.Name+'.'+isnull(f.Name,'')+'.'+isnull(v.Name,'')+'.'+bt.Name as HeirarchyOfParents
from BigTable bt
left join Vegetables v
on bt.OneOfTheThree = v.name
left join Fruits f
on coalesce(v.FruitName,bt.OneOfTheThree) = f.Name
left join Planets p
on coalesce(f.PlanetName,v.PlanetName,bt.OneOfTheThree) = p.Name
The last join you can remove if the table is consistent with the others, as it does not bring new information (the planet name is already there).
The improvements that you can bring here are with indexes on the tables, if you are able to do that.
Ok, with the new information, the easiest way I can think of is the following:
;with ftemp as (
select
name as path
,PlanetName
,name as root
,name as name
,FruitName as parent
,0 as cnt
from fruits
union all
select
fruits.name + '.' + ftemp.path
,ftemp.PlanetName
,root
,fruits.name
,cnt+1
from fruits
join ftemp
on fruits.name= ftemp.parent
)
,fg as (
select
name
,max(cnt) as cnt
from ftemp
group by name
)
,f as (
select
ftemp.*
from ftemp
join fg
on ftemp.cnt = fg.cnt
and ftemp.name = fg.name
)
,vtemp (same ideea)
,vg (same ideea)
,v (same ideea)
select
bt.Name
,bt.OneOfTheThree
,p.Name+'.'+isnull(f.Path+'.','')+isnull(v.Path+'.','')+bt.Name as HeirarchyOfParents
from BigTable bt
left join v
on bt.OneOfTheThree = v.name
left join f
on coalesce(v.FruitName,bt.OneOfTheThree) = f.Name
left join Planets p
on coalesce(f.PlanetName,v.PlanetName,bt.OneOfTheThree) = p.Name
With this approach though .. I have no idea on the performance it will yield. So it's up to you to complete the query and test.
Hope it helps.